Abstract
Functionally related genes often appear in each other's neighborhood on the genome; however, the order of the genes may not be the same. These groups or clusters of genes may have an ancient evolutionary origin or may signify some other critical phenomenon and may also aid in function prediction of genes. Such gene clusters also aid toward solving the problem of local alignment of genes. Similarly, clusters of protein domains, albeit appearing in different orders in the protein sequence, suggest common functionality in spite of being nonhomologous. In the paper, we address the problem of automatically discovering clusters of entities, be they genes or domains: we formalize the abstract problem as a discovery problem called the π pattern problem and give an algorithm that automatically discovers the clusters of patterns in multiple data sequences. We take a model-less approach and introduce a notation for maximal patterns that drastically reduces the number of valid cluster patterns, without any loss of information, We demonstrate the automatic pattern discovery tool on motifs on E. Coli protein sequences.
Original language | English |
---|---|
Pages (from-to) | 1050-1060 |
Number of pages | 11 |
Journal | Journal of Computational Biology |
Volume | 11 |
Issue number | 6 |
DOIs | |
State | Published - 2004 |
Keywords
- Clusters
- Combinatorial algorithms on words
- Data mining
- Design and analysis of algorithms
- Discovery
- Motifs
- Patterns
ASJC Scopus subject areas
- Modeling and Simulation
- Molecular Biology
- Genetics
- Computational Mathematics
- Computational Theory and Mathematics