Text Alignment in the Service of Text Reuse Detection

Research output: Contribution to journalArticlepeer-review

Abstract

This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions.

Original languageEnglish
Article number3395
JournalApplied Sciences (Switzerland)
Volume15
Issue number6
DOIs
StatePublished - Mar 2025

Bibliographical note

Publisher Copyright:
© 2025 by the authors.

Keywords

  • ancient languages
  • natural language processing
  • Smith–Waterman algorithm
  • text alignment
  • text reuse detection
  • word embeddings

ASJC Scopus subject areas

  • General Materials Science
  • Instrumentation
  • General Engineering
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes

Fingerprint

Dive into the research topics of 'Text Alignment in the Service of Text Reuse Detection'. Together they form a unique fingerprint.

Cite this