Functionally related genes often appear in each others neighborhood on the genome, however the order of the genes may not be the same. These groups or clusters of genes may have an ancient evolutionary origin or may signify some other critical phenomenon and may also aid in function prediction of genes. Such gene clusters also aid toward solving the problem of local alignment of genes. Similarly, clusters of protein domains, albeit appearing in different orders in the protein sequence, suggest common functionality in spite of being nonhomologous. In the paper we address the problem of automatically discovering clusters of entities be it genes or domains: we formalize the abstract problem as a discovery problem called the πpattern problem and give an algorithm that automatically discovers the clusters of patterns in multiple data sequences. We take a model-less approach and introduce a notation for maximal patterns that drastically reduces the number of valid cluster patterns, without any loss of information, We demonstrate the automatic pattern discovery tool on motifs on E Coli protein sequences.
Bibliographical noteFunding Information:
This project was supported by the Vice-Chancellor for Research & Technology of Iran University of Medical Sciences (grant No. 33171-02-97 ), Tehran, Iran. Also, the authors would like to appreciate the Neurobiology Research Center (grant No. 97-6-NBC) and the Neuroscience Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran for valuable cooperation.
- Combinatorial algorithms on words
- Data mining
- Design and analysis of algorithms
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science (all)