Abstract
This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions.
Original language | English |
---|---|
Article number | 3395 |
Journal | Applied Sciences (Switzerland) |
Volume | 15 |
Issue number | 6 |
DOIs | |
State | Published - Mar 2025 |
Bibliographical note
Publisher Copyright:© 2025 by the authors.
Keywords
- ancient languages
- natural language processing
- Smith–Waterman algorithm
- text alignment
- text reuse detection
- word embeddings
ASJC Scopus subject areas
- General Materials Science
- Instrumentation
- General Engineering
- Process Chemistry and Technology
- Computer Science Applications
- Fluid Flow and Transfer Processes