The method of N-grams in large-scale clustering of DNA texts

Z. Volkovich, V. Kirzhner, A. Bolshoy, E. Nevo, A. Korol

Research output: Contribution to journalArticlepeer-review

Abstract

This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.

Original languageEnglish
Pages (from-to)1902-1912
Number of pages11
JournalPattern Recognition
Volume38
Issue number11
DOIs
StatePublished - Nov 2005

Keywords

  • Clustering
  • Compositional spectra
  • Genome comparisons
  • N-grams
  • Strings mismatching

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'The method of N-grams in large-scale clustering of DNA texts'. Together they form a unique fingerprint.

Cite this