TY - GEN
T1 - Intra-firm information flow
T2 - 10th International Symposium on Intelligent Data Analysis, IDA 2011
AU - Berchenko, Yakir
AU - Daliot, Or
AU - Brueller, Nir N.
PY - 2011
Y1 - 2011
N2 - This paper endeavors to bring together two largely disparate areas of research. On one hand, text mining methods treat each document as an independent instance despite the fact that in many text domains, documents are linked and their topics are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are typically linked by citations. On the other hand, Social Network Analysis (SNA) typically treats edges between nodes according to "flat" attributes in binary form alone. This paper proposes a simple approach that addresses both these issues in data mining scenarios involving corpora of linked documents. According to this approach, after assigning weights to the edges between documents, based on the content of the documents associated with each edge, we apply standard SNA and network theory tools to the network. The method is tested on the Enron email corpus and successfully discovers the central people in the organization and the relevant communications between them. Furthermore, Our findings suggest that due to the non-conservative nature of information, conservative centrality measures (such as PageRank) are less adequate here than non-conservative centrality measures (such as eigenvector centrality).
AB - This paper endeavors to bring together two largely disparate areas of research. On one hand, text mining methods treat each document as an independent instance despite the fact that in many text domains, documents are linked and their topics are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are typically linked by citations. On the other hand, Social Network Analysis (SNA) typically treats edges between nodes according to "flat" attributes in binary form alone. This paper proposes a simple approach that addresses both these issues in data mining scenarios involving corpora of linked documents. According to this approach, after assigning weights to the edges between documents, based on the content of the documents associated with each edge, we apply standard SNA and network theory tools to the network. The method is tested on the Enron email corpus and successfully discovers the central people in the organization and the relevant communications between them. Furthermore, Our findings suggest that due to the non-conservative nature of information, conservative centrality measures (such as PageRank) are less adequate here than non-conservative centrality measures (such as eigenvector centrality).
KW - Natural language processing
KW - information flow
KW - social network analysis
UR - http://www.scopus.com/inward/record.url?scp=80455129903&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-24800-9_6
DO - 10.1007/978-3-642-24800-9_6
M3 - Conference contribution
AN - SCOPUS:80455129903
SN - 9783642247996
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 34
EP - 42
BT - Advances in Intelligent Data Analysis X - 10th International Symposium, IDA 2011, Proceedings
Y2 - 29 October 2011 through 31 October 2011
ER -