Abstract
Social network researchers have been tackling community detection / community search for over a decade. Detecting communities – small groups of people who know each other and interact with each other – have numerous applications, starting from marketing and computational advertisement, all the way to the homeland security domain. By now, the problem can be considered mostly
solved, in either its unsupervised form (community detection) or semi-supervised form (community search). In our quest to answer general – and very exciting – questions What are people up to? What do they care about? What are they discussing?, we move beyond detecting communities to circumscribing subpopulations – large groups of people who share some common characteristics, for example activists, students, engineers, New Yorkers, football fans etc. We want to know what are < · · · > talking about on Twitter,
where < · · · > is any subpopulation. Initially, the subpopulation is characterized by a few representative members, who are treated as seeds in the iterative Personalized PageRank (PPR) framework that enlarges the subpopulation at each iteration. We immediately hit the scalability limitation, which we overcome by proposing the Splash PPR algorithm, inspired by Splash Belief Propagation. We
implement Splash PPR on Apache Spark and show its efficiency and effectiveness on extracting the Twitter stream of a subpopulation of machine learning practitioners, by which we pave the road to distilling valuable signal out of the sea of Twitter noise.
solved, in either its unsupervised form (community detection) or semi-supervised form (community search). In our quest to answer general – and very exciting – questions What are people up to? What do they care about? What are they discussing?, we move beyond detecting communities to circumscribing subpopulations – large groups of people who share some common characteristics, for example activists, students, engineers, New Yorkers, football fans etc. We want to know what are < · · · > talking about on Twitter,
where < · · · > is any subpopulation. Initially, the subpopulation is characterized by a few representative members, who are treated as seeds in the iterative Personalized PageRank (PPR) framework that enlarges the subpopulation at each iteration. We immediately hit the scalability limitation, which we overcome by proposing the Splash PPR algorithm, inspired by Splash Belief Propagation. We
implement Splash PPR on Apache Spark and show its efficiency and effectiveness on extracting the Twitter stream of a subpopulation of machine learning practitioners, by which we pave the road to distilling valuable signal out of the sea of Twitter noise.
Original language | English |
---|---|
Title of host publication | The International Workshop on Mining and Learning with Graphs (MLG), collocated with KDD |
State | Published - 2018 |