Abstract
Making ancient handwritten manuscripts accessible to the general public is challenging, for several reasons. Foremost, they are handwritten. Each and every one is unique, so there is a need for manual transcription for providing enough examples for training a machine-learning-based algorithm to automatically transcribe the handwritten text. Moreover, the quality of the text is diverse-over time the ink faded, pages were damaged, and so forth. Furthermore, the boundaries of the textual regions on a page and the lines of text are not standard. Sometimes there are corrections above the lines, the lines are curved, there are comments and annotations on the margins, and more. A possible solution for these challenges is having a "person in the loop."However, manual correction brings with it another challenge-how to address disagreement between annotations (as usually several corrections are considered before a decision is taken about the correct transcription). Tikkoun-Sofrim is a system that integrates automatic handwritten text recognition with manual, crowdsourced error correction, introducing an automatic decision process about when to stop asking for additional transcription and selecting the best transcription, declaring it as the recommended agreed reading. The system was applied to several manuscripts of "Midrash Tanhuma,"a medieval Hebrew rabbinic homiletic text, achieving a high level of success.
Original language | English |
---|---|
Article number | 20 |
Journal | Journal on Computing and Cultural Heritage |
Volume | 15 |
Issue number | 2 |
DOIs | |
State | Published - Jun 2022 |
Bibliographical note
Funding Information:This project has received funding by the Israeli Ministry of Sciences and Technology, from the French Ministry of Sciences, Higher Education and Innovation, and the French Ministry of European and Foreign Affairs in the frame of the PHC-Maimonide 41146YC and from the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 871127. Authors’ addresses: A. J. Wecker, V. Raziel-Kretzmer, M. Lavee, T. Kuflik, D. Elovits, M. Schorr, and U. Schor, University of Haifa, Haifa, Israel; emails: {ajwecker, veredrazielk}@gmail.com, mlavee@research.haifa.ac.il, tsvikak@is.haifa.ac.il, dror.elovits@gmail.com, moshe@ schorr.org, uschor@gmail.com; B. Kiessling, D. S. B. Ezra, and P. Jablonski, AOrOc (UMR 8546), EPHE, PSL, Paris, France; emails: {benjamin. kiessling, daniel.stoekl}@ephe.psl.eu, pauljablonski1989@gmail.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1556-4673/2022/04-ART20 $15.00 https://doi.org/10.1145/3476776
Publisher Copyright:
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Keywords
- CATTI
- HTR
- crowd-sourcing
- handwritten text recognition
- transcription
ASJC Scopus subject areas
- Conservation
- Information Systems
- Computer Science Applications
- Computer Graphics and Computer-Aided Design