The goal of a threshold query is to detect all objects whose score exceeds a given threshold. This type of query is used in many settings, such as data mining, event triggering, and top-k selection. Often, threshold queries are performed over distributed data. Given database relations that are distributed over many nodes, an object's score is computed by aggregating the value of each attribute, applying a given scoring function over the aggregation, and thresholding the function's value. However, joining all the distributed relations to a central database might incur prohibitive overheads in bandwidth, CPU, and storage accesses. Efficient algorithms required to reduce these costs exist only for monotonic aggregation threshold queries and certain specific scoring functions. We present a novel approach for efficiently performing general distributed threshold queries. To the best of our knowledge, this is the first solution to the problem of performing such queries with general scoring functions. We first present a solution for monotonic functions, and then introduce a technique to solve for other functions by representing them as a difference of monotonic functions. Experiments with real-world data demonstrate the method's effectiveness in achieving low communication and access costs.
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Computer Science (all)