Abstract
This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.
| Original language | English |
|---|---|
| Pages (from-to) | 1902-1912 |
| Number of pages | 11 |
| Journal | Pattern Recognition |
| Volume | 38 |
| Issue number | 11 |
| DOIs | |
| State | Published - Nov 2005 |
Keywords
- Clustering
- Compositional spectra
- Genome comparisons
- N-grams
- Strings mismatching
ASJC Scopus subject areas
- Software
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence