A large-scale comparison of genomic sequences: One promising approach

Research output: Contribution to journalReview articlepeer-review


We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet {A, C, G, T}). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.

Original languageEnglish
Pages (from-to)73-89
Number of pages17
JournalActa Biotheoretica
Issue number2
StatePublished - 2003


  • DNA linguistics
  • Rank correlation
  • Sequence analysis
  • Statistical geometry

ASJC Scopus subject areas

  • General Agricultural and Biological Sciences
  • General Environmental Science
  • Applied Mathematics
  • Philosophy
  • General Biochemistry, Genetics and Molecular Biology


Dive into the research topics of 'A large-scale comparison of genomic sequences: One promising approach'. Together they form a unique fingerprint.

Cite this