Abstract
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of high dimensional databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the vision that these database are soon to contain large amounts of genomic data, characterized by its high dimensionality. Current decision tree algorithms would require high communication bandwidth when executed on such data, which is not likely to exist in large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99%. Scalability tests show that the algorithm scales well with both the size of the dataset, the dimensionality of the data, and the size of the distributed system.
Original language | English |
---|---|
Pages | 466-470 |
Number of pages | 5 |
DOIs | |
State | Published - 2005 |
Event | 5th SIAM International Conference on Data Mining, SDM 2005 - Newport Beach, CA, United States Duration: 21 Apr 2005 → 23 Apr 2005 |
Conference
Conference | 5th SIAM International Conference on Data Mining, SDM 2005 |
---|---|
Country/Territory | United States |
City | Newport Beach, CA |
Period | 21/04/05 → 23/04/05 |
Keywords
- Classification
- Data mining
- Decision trees
- Distributed algorithms
- High dimension data
ASJC Scopus subject areas
- Software