Abstract
In many scientific tasks we are interested in finding correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on data distribution or the type of correlation, and, how to search efficiently for the most correlated attribute sets. We answer these questions for discovery tasks with categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, in order to obtain a reliable, interpretable, and non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through a case study we confirm that our discovery framework identifies interesting and meaningful correlations.
Original language | English |
---|---|
Title of host publication | Proceedings - 19th IEEE International Conference on Data Mining, ICDM 2019 |
Editors | Jianyong Wang, Kyuseok Shim, Xindong Wu |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1252-1257 |
Number of pages | 6 |
ISBN (Electronic) | 9781728146034 |
DOIs | |
State | Published - Nov 2019 |
Externally published | Yes |
Event | 19th IEEE International Conference on Data Mining, ICDM 2019 - Beijing, China Duration: 8 Nov 2019 → 11 Nov 2019 |
Publication series
Name | Proceedings - IEEE International Conference on Data Mining, ICDM |
---|---|
Volume | 2019-November |
ISSN (Print) | 1550-4786 |
Conference
Conference | 19th IEEE International Conference on Data Mining, ICDM 2019 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 8/11/19 → 11/11/19 |
Bibliographical note
Publisher Copyright:© 2019 IEEE.
Keywords
- Branch-and-bound
- Information theory
- Knowledge discovery
- Optimization
- Total correlation
ASJC Scopus subject areas
- General Engineering