A subquadratic sequence alignment algorithm for unrestricted scoring matrices

Maxime Crochemore, Gad M. Landau, Michal Ziv-Ukelson

Research output: Contribution to journalArticlepeer-review

Abstract

Given two strings of size n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison-Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195-197] uses a dynamic programming matrix and compares the two strings in O(n 2) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 / log n), algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(n 2/log n), where h ≤ 1 is the entropy of the text. We also present an algorithm for comparing two run-length encoded strings of length m and n, compressed into m′ and n′, runs, respectively, in O(m′n + n′m) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty.

Original languageEnglish
Pages (from-to)1654-1673
Number of pages20
JournalSIAM Journal on Computing
Volume32
Issue number6
DOIs
StatePublished - Sep 2003

Keywords

  • Alignment
  • Dynamic programming
  • Run length
  • Text compression

ASJC Scopus subject areas

  • General Computer Science
  • General Mathematics

Fingerprint

Dive into the research topics of 'A subquadratic sequence alignment algorithm for unrestricted scoring matrices'. Together they form a unique fingerprint.

Cite this