Abstract:
The size distributions of all known coding and noncoding DNA sequences are studied in all human chromosomes. In a unified approach, both introns and intergenic regions are treated as noncoding regions. The distributions of noncoding segments Pnc S of size S present long tails Pnc S S−1− nc, with exponents nc ranging between 0.71 for chromosome 13 and 1.2 for chromosome 19 . On the contrary, the exponential, short-range decay terms dominate in the distributions of coding exon segments Pc S in all chromosomes. Aiming to address the emergence of these statistical features, minimal, stochastic, mean-field models are
proposed, based on randomly aggregating DNA strings with duplication, influx and outflux of genomic segments. These minimal models produce both the short-range statistics in the coding and the observed power law and fractal statistics in the noncoding DNA. The minimal models also demonstrate that although the two
systems coding and noncoding coexist, alternating on the same linear chain, they act independently: the coding as a closed, equilibrium system and the noncoding as an open, out-of-equilibrium one