An algorithm for approximate tandem repeats

Gad M. Landau, Jeanette P. Schmidt

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A perfect tandem repeat within a string S is a substring r = r1,… r2l of S, for which r1… r1 = rl+1 … r2l. An approximate tandem repeat is a substring r =r1,…, rl1 ,… rl, for which r1,…, r1l and rl1+1,… rl are similar. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = ūȗ, for which the Hamming distance of u and ȗ is at most k in O(nklog(n/k)) time, or all those for which the edit distance of ū and ȗ is at most k, in O(nk log k log n) time.

Original languageEnglish
Title of host publicationCombinatorial Pattern Matching - 4th Annual Symposium, CPM 1993, Proceedings
EditorsAlberto Apostolico, Alberto Apostolico, Maxime Crochemore , Zvi Galil, Zvi Galil, Udi Manber
PublisherSpringer Verlag
Pages120-133
Number of pages14
ISBN (Print)9783540567646
DOIs
StatePublished - 1993
Externally publishedYes
EventConference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2017 and 16th International Workshop on Intuitionistic Fuzzy Sets and Generalized Nets, IWIFSGN 2017 - Warsaw, Poland
Duration: 11 Sep 201715 Sep 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume684 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceConference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2017 and 16th International Workshop on Intuitionistic Fuzzy Sets and Generalized Nets, IWIFSGN 2017
Country/TerritoryPoland
CityWarsaw
Period11/09/1715/09/17

Bibliographical note

Funding Information:
The perfect tandem repeat problem is a well studied problem. Main and Lorentz \[ML-84\]p resent an O(nlog n) algorithm, which reports all perfect tandem repeats and Apostolico \[Ap-92\] describes an optimal speed-up parallel algorithm for the problem. Motivations for the exact repeat problem can be found in research in formal languages (see a survey in \[ML-85\]). Important motivations for the approximate tandem repeat problem are found in different areas. In molecular Biology, tandem repeats play an important role in both DNA and protein sequences. At the DNA level they act as "hot spots" that enable these regions to more rapidly conform to environmental changes. Such repeats are also frequent in bacterial proteins, where their function is less understood. The repeats in these applications are not exact. One can use different criterions to measure the similarity of the repeats. In this paper we consider two simple measures of similarity. While these measures are suitable for several of the above applications, they also lend themselves to the design of fast algorithms. Given a string S and an integer k the algorithm finds all non empty substrings r = tiff, for which: (i) the Hamming distance of fi and fi is at most k; or (it) the edit distance of fi and fi is at most k. The Hamming distance of u and v is defined as the number of substitutions necessary to get v from u, (u and v must be of same length). The edit distance, as defined by Levenshtein \[L-66\]i,s the minimum number of deletions in u, substitutions or insertions in v necessary to get v from u. In the case of the Hamming distance * e-mail: landau@pucs2.poly.edu. Partially supported by the New York State Science and Technology Foundation Center for Advanced Technology. ** e-mail: jps@pucs4.poly.edu. Partially supported by NSF grant CCR-9110255 and the New York State Science and Technology Foundation Center for Advanced Technology.

Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 1993.

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science (all)

Fingerprint

Dive into the research topics of 'An algorithm for approximate tandem repeats'. Together they form a unique fingerprint.

Cite this