A method for cross-document narrative alignment of a two-hundred-sixty-million word corpus

Ben Miller, Jennifer Olive, Shakthidhar Gopavaram, Yanjun Zhao, Ayush Shrestha, Cynthia Berger

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Identifying similar narrative sections across longer documents would help identify key events within a corpus, enrich understanding of those events, provide a mechanism for organizing corpora according to their event content, and allow for bottom-up testing of theories of narrative. This paper proposes an automated method for narrative alignment across large textual corpora using techniques from natural language processing and similarity-based image segmentation. This method proceeds by segmenting each document into a series of events, constructs sequences of abstracted representations of those events, compares pairs of sequences to generate image matrices, segments the images, identifies similar segments to discover commonly occurring narrative units, and, finally, returns the source sentences to make the clusters of narrative similarity readable. Preliminary tests of elements of this method were conducted on a small heterogeneous corpus (< 100 documents) and a moderate heterogeneous corpus (10k documents). Further implementation as described in this position paper is necessary to scale to the full 251k document corpus from which the moderate corpus was drawn.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
EditorsFeng Luo, Kemafor Ogan, Mohammed J. Zaki, Laura Haas, Beng Chin Ooi, Vipin Kumar, Sudarsan Rachuri, Saumyadipta Pyne, Howard Ho, Xiaohua Hu, Shipeng Yu, Morris Hui-I Hsiao, Jian Li
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1673-1677
Number of pages5
ISBN (Electronic)9781479999255
DOIs
StatePublished - 22 Dec 2015
Externally publishedYes
Event3rd IEEE International Conference on Big Data, IEEE Big Data 2015 - Santa Clara, United States
Duration: 29 Oct 20151 Nov 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Conference

Conference3rd IEEE International Conference on Big Data, IEEE Big Data 2015
Country/TerritoryUnited States
CitySanta Clara
Period29/10/151/11/15

Bibliographical note

Publisher Copyright:
© 2015 IEEE.

Keywords

  • Computational models of narrative
  • big data
  • computational linguistics
  • text mining

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'A method for cross-document narrative alignment of a two-hundred-sixty-million word corpus'. Together they form a unique fingerprint.

Cite this