A linguistic complexity measure was applied to the complete genomes of HIV-1, Escherichia coli, Bacillus subtilis, Haemophilus influenzae, Mycoplasma genitalium, and to long human and yeast genomic fragments. Complexity values averaged over entire genomic sequences were compared, as were predicted average values of intrinsic DNA curvature. We found that both the most curved and the least complex fragments are located preferentially in non-coding parts of the genome. Analysis of location of the most curved and the simplest regions in bacteria showed that the low-complexity segments are preferentially located in close proximity to the highly curved sequences, which are, in turn, placed from 100 to 200 bases upstream to the start of the nearest coding sequence. We conclude that the parallel analysis of sequence complexity and DNA curvature might provide important information about sequence-structure-function relationship in genomes.
Bibliographical noteFunding Information:
The authors would like to express their gratitude to Drs. A. Konopka, E.N. Trifonov, and D. Landsman for their invaluable suggestions during the preparation of the manuscript. Dr. Birgit An der Lan provided excellent editorial assistance. A.B. was supported by the Danish National Research Foundation and NCBI Scientific Visitors Program at the National Library of Medicine, NIH.
- DNA curvature
- Linguistic analysis
- Nucleotide composition
ASJC Scopus subject areas
- Applied Microbiology and Biotechnology
- General Chemical Engineering