Abstract
A growing number of measures of sequence similarity are being based on some underlying notion of relative compressibility. Within this paradigm, similar sequences are expected to share a large number of common substrings, or subsequences, or more complex patterns or motifs, and so on. In this paper, measures of sequence similarity are introduced and studied in which patterns in a pair are considered similar if they coincide up to a preset number of mismatches, that is, within a bounded Hamming distance. It is shown here that for some such measures bounds are achievable that are slightly better than O(n2). Preliminary experiments demonstrate the potential applicability to phylogeny and classification of these similarity measures.
Original language | English |
---|---|
Pages (from-to) | 76-90 |
Number of pages | 15 |
Journal | Theoretical Computer Science |
Volume | 638 |
DOIs | |
State | Published - 25 Jul 2016 |
Bibliographical note
Funding Information:A. Apostolico was supported in part by United States–Israel Binational Science Foundation (BSF) Grants Nos. 2008217 and 2014028 . C. Guerra was supported in part by United States–Israel Binational Science Foundation (BSF) Grant No. 2014028 . G. Landau was partially supported by the National Science Foundation Award 0904246 , Israel Science Foundation grants 347/09 and 571/14 , Grants Nos. 2008217 and 2014028 from the United States–Israel Binational Science Foundation (BSF) and DFG . C. Pizzi was partially supported by PRIN No. 20122F87B2 , financed by the Italian MIUR .
Publisher Copyright:
© 2016 Elsevier B.V.
Keywords
- Alignment free distances
- Binary string
- Longest common substring
- Mismatches
- Pattern matching
- String comparison
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science (all)