TY - GEN
T1 - Efficient discovery of interesting patterns based on strong closedness
AU - Boley, Mario
AU - Horváth, Tamás
AU - Wrobel, Stefan
PY - 2009
Y1 - 2009
N2 - Finding patterns that are interesting to a user in a certain application context is one of the central goals of Data Mining research. Regarding all patterns above a certain frequency threshold as interesting is one way of defining interestingness. In this paper, however, we argue that in many applications, a different notion of interestingness is required in order to be able to capture "long", and thus particularly informative, patterns that are correspondingly of low frequency. To identify such patterns, our proposed measure of interestingness is based on the degree or strength of closedness of the patterns. We show that (a) indeed this definition selects long interesting patterns that are difficult to identify with frequency-based approaches, and (b) that it selects patterns that are robust against noise and/or dynamic changes. We prove that the family of interesting patterns proposed here forms a closure system and use the corresponding closure operator to design a mining algorithm listing these patterns in amortized quadratic time. In particular, for non-sparse datasets its time complexity is O(nm) per pattern, where n denotes the number of items and m the size of the database. This is equal to the best known time bound for listing ordinary closed frequent sets, which is a special case of our problem. We also report empirical results with real-world datasets.
AB - Finding patterns that are interesting to a user in a certain application context is one of the central goals of Data Mining research. Regarding all patterns above a certain frequency threshold as interesting is one way of defining interestingness. In this paper, however, we argue that in many applications, a different notion of interestingness is required in order to be able to capture "long", and thus particularly informative, patterns that are correspondingly of low frequency. To identify such patterns, our proposed measure of interestingness is based on the degree or strength of closedness of the patterns. We show that (a) indeed this definition selects long interesting patterns that are difficult to identify with frequency-based approaches, and (b) that it selects patterns that are robust against noise and/or dynamic changes. We prove that the family of interesting patterns proposed here forms a closure system and use the corresponding closure operator to design a mining algorithm listing these patterns in amortized quadratic time. In particular, for non-sparse datasets its time complexity is O(nm) per pattern, where n denotes the number of items and m the size of the database. This is equal to the best known time bound for listing ordinary closed frequent sets, which is a special case of our problem. We also report empirical results with real-world datasets.
UR - http://www.scopus.com/inward/record.url?scp=72749092432&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:72749092432
SN - 9781615671090
T3 - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics
SP - 997
EP - 1008
BT - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics 133
T2 - 9th SIAM International Conference on Data Mining 2009, SDM 2009
Y2 - 30 April 2009 through 2 May 2009
ER -