In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaustera information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.
|Title of host publication||SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data|
|Publisher||Association for Computing Machinery|
|Number of pages||13|
|State||Published - 26 Jun 2016|
|Event||2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States|
Duration: 26 Jun 2016 → 1 Jul 2016
|Name||Proceedings of the ACM SIGMOD International Conference on Management of Data|
|Conference||2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016|
|Period||26/06/16 → 1/07/16|
Bibliographical notePublisher Copyright:
© 2016 ACM.
ASJC Scopus subject areas
- Information Systems