Decision tree induction in high dimensional, hierarchically distributed databases

Amir Bar-Or, Assaf Schuster, Ran Wolff, Daniel Keren

Research output: Contribution to conferencePaperpeer-review

Abstract

Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of high dimensional databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the vision that these database are soon to contain large amounts of genomic data, characterized by its high dimensionality. Current decision tree algorithms would require high communication bandwidth when executed on such data, which is not likely to exist in large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99%. Scalability tests show that the algorithm scales well with both the size of the dataset, the dimensionality of the data, and the size of the distributed system.

Original languageEnglish
Pages466-470
Number of pages5
DOIs
StatePublished - 2005
Event5th SIAM International Conference on Data Mining, SDM 2005 - Newport Beach, CA, United States
Duration: 21 Apr 200523 Apr 2005

Conference

Conference5th SIAM International Conference on Data Mining, SDM 2005
Country/TerritoryUnited States
CityNewport Beach, CA
Period21/04/0523/04/05

Keywords

  • Classification
  • Data mining
  • Decision trees
  • Distributed algorithms
  • High dimension data

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Decision tree induction in high dimensional, hierarchically distributed databases'. Together they form a unique fingerprint.

Cite this