Incorporating homologues into sequence embeddings for protein analysis

Eleazar Eskin, Sagi Snir

Research output: Contribution to journalArticlepeer-review

Abstract

Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.

Original languageEnglish
Pages (from-to)717-738
Number of pages22
JournalJournal of Bioinformatics and Computational Biology
Volume5
Issue number3
DOIs
StatePublished - Jun 2007
Externally publishedYes

Bibliographical note

Funding Information:
E. Eskin is partially supported by National Science Foundation Grant No. 0513612 and National Institutes of Health Grant No. 1K25HL080079.

Keywords

  • Kernel methods
  • Protein classification
  • Sequence alignment

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Incorporating homologues into sequence embeddings for protein analysis'. Together they form a unique fingerprint.

Cite this