Hierarchical decision tree induction in distributed genomic databases

Amir Bar-Or, Daniel Keren, Assaf Schuster, Ran Wolff

Research output: Contribution to journalArticlepeer-review


Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.

Original languageEnglish
Pages (from-to)1138-1151
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Issue number8
StatePublished - Aug 2005


  • Classification
  • Data mining
  • Decision trees
  • Distributed algorithms

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics


Dive into the research topics of 'Hierarchical decision tree induction in distributed genomic databases'. Together they form a unique fingerprint.

Cite this