Abstract
In this paper we propose a natural approach to characterizing genomic sequences, based on occurrences of fixed length words (strings over the alphabet {A,C,G,T}) from a sufficiently large set W of arbitrary (in general case) words. According to our approach, any genomic sequence can be characterized by a histogram of frequencies of imperfect matching of words from the set W that is called a compositional spectrum (CS). The specificity of CSs is manifest in a reasonable similarity of spectra obtained on different stretches of the same genome and, simultaneously, in a broad range of dissimilarities between spectral characteristics of different genomes. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.
Original language | English |
---|---|
Pages (from-to) | 447-457 |
Number of pages | 11 |
Journal | Physica A: Statistical Mechanics and its Applications |
Volume | 312 |
Issue number | 3-4 |
DOIs | |
State | Published - 15 Sep 2002 |
Keywords
- Compositional spectra
- DNA sequences
- Imperfect matching
- Sequence comparisons
- Set of words
ASJC Scopus subject areas
- Statistics and Probability
- Condensed Matter Physics