Abstract
State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language. We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic). All our data, code, and models are openly available at https://github.com/anutkk/ilmja.
Original language | English |
---|---|
Title of host publication | Document Analysis and Recognition – ICDAR 2023 - 17th International Conference, Proceedings |
Editors | Gernot A. Fink, Rajiv Jain, Koichi Kise, Richard Zanibbi |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 147-164 |
Number of pages | 18 |
ISBN (Print) | 9783031416842 |
DOIs | |
State | Published - 2023 |
Event | 17th International Conference on Document Analysis and Recognition, ICDAR 2023 - San José, United States Duration: 21 Aug 2023 → 26 Aug 2023 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 14190 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 17th International Conference on Document Analysis and Recognition, ICDAR 2023 |
---|---|
Country/Territory | United States |
City | San José |
Period | 21/08/23 → 26/08/23 |
Bibliographical note
Publisher Copyright:© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Keywords
- Handwritten text recognition
- Hebrew manuscripts
- Language model
- Optical character recognition
- Transfer learning
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science