A perfect tandem repeat within a string S is a substring r = r1,… r2l of S, for which r1… r1 = rl+1 … r2l. An approximate tandem repeat is a substring r =r1,…, rl1 ,… rl, for which r1,…, r1l and rl1+1,… rl are similar. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = ūȗ, for which the Hamming distance of u and ȗ is at most k in O(nklog(n/k)) time, or all those for which the edit distance of ū and ȗ is at most k, in O(nk log k log n) time.
|Title of host publication||Combinatorial Pattern Matching - 4th Annual Symposium, CPM 1993, Proceedings|
|Editors||Alberto Apostolico, Alberto Apostolico, Maxime Crochemore , Zvi Galil, Zvi Galil, Udi Manber|
|Number of pages||14|
|State||Published - 1993|
|Event||Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2017 and 16th International Workshop on Intuitionistic Fuzzy Sets and Generalized Nets, IWIFSGN 2017 - Warsaw, Poland|
Duration: 11 Sep 2017 → 15 Sep 2017
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2017 and 16th International Workshop on Intuitionistic Fuzzy Sets and Generalized Nets, IWIFSGN 2017|
|Period||11/09/17 → 15/09/17|
Bibliographical noteFunding Information:
The perfect tandem repeat problem is a well studied problem. Main and Lorentz \[ML-84\]p resent an O(nlog n) algorithm, which reports all perfect tandem repeats and Apostolico \[Ap-92\] describes an optimal speed-up parallel algorithm for the problem. Motivations for the exact repeat problem can be found in research in formal languages (see a survey in \[ML-85\]). Important motivations for the approximate tandem repeat problem are found in different areas. In molecular Biology, tandem repeats play an important role in both DNA and protein sequences. At the DNA level they act as "hot spots" that enable these regions to more rapidly conform to environmental changes. Such repeats are also frequent in bacterial proteins, where their function is less understood. The repeats in these applications are not exact. One can use different criterions to measure the similarity of the repeats. In this paper we consider two simple measures of similarity. While these measures are suitable for several of the above applications, they also lend themselves to the design of fast algorithms. Given a string S and an integer k the algorithm finds all non empty substrings r = tiff, for which: (i) the Hamming distance of fi and fi is at most k; or (it) the edit distance of fi and fi is at most k. The Hamming distance of u and v is defined as the number of substitutions necessary to get v from u, (u and v must be of same length). The edit distance, as defined by Levenshtein \[L-66\]i,s the minimum number of deletions in u, substitutions or insertions in v necessary to get v from u. In the case of the Hamming distance * e-mail: firstname.lastname@example.org. Partially supported by the New York State Science and Technology Foundation Center for Advanced Technology. ** e-mail: email@example.com. Partially supported by NSF grant CCR-9110255 and the New York State Science and Technology Foundation Center for Advanced Technology.
© Springer-Verlag Berlin Heidelberg 1993.
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science (all)