Abstract
We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet {A, C, G, T}). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.
Original language | English |
---|---|
Pages (from-to) | 73-89 |
Number of pages | 17 |
Journal | Acta Biotheoretica |
Volume | 51 |
Issue number | 2 |
DOIs | |
State | Published - 2003 |
Keywords
- DNA linguistics
- Rank correlation
- Sequence analysis
- Statistical geometry
ASJC Scopus subject areas
- General Biochemistry, Genetics and Molecular Biology
- Philosophy
- General Environmental Science
- General Agricultural and Biological Sciences
- Applied Mathematics