+ What is a text? Starting to understand the theory behind Automated Text Recognition

What is a text? A simple question with a not so simple answer. Coming from the scholarly editing tradition, Patrick Sahle, Professor at Albertus Magnus University of Cologne, has demonstrated in detail how different the perception or rather the understanding of text can be: from a string of signs on a paper to a work by a literate individual, that has to be (re)constructed from several versions and prints.

To systematically analyze different aspects of a text, Sahle started drawing the so called ‘text-wheel; (there’s a chapter about this in his third volume on scholarly digital editions, p. 45-55; see also Sahle, Patrick: What is a Scholarly Digital Edition?, in: Matthew James Driscoll and Elena Pierazzo (eds.), Digital Scholarly Editing: Theories and Practices. Cambridge, UK: Open Book Publishers, 2016.  OBP.0095, p. 20-39 ).

The result is a range of different entities that a text can be understood as; some of the meanings oppose each other, others do not differ much.

In order to start understanding Automated Text Recognition from a theoretical stand-point, we started discussing with Professor Sahle, how and what form of ‘text’ is recognized in Transkribus (and also in general, if you’re using recognition tools such as OCR engines). The result is our own ‘text-wheel’, drawn by Julia Sorouri.

Most importantly text in Transkribus is understood as signs on a surface; you will need facsimiles or rather digitized images of documents in order to perform Automated Text Recognition.  Through interpretation via machine learning (or typing by a human), it’s possible to produce text as it exists as a document (separated into text and line regions, and possibly word regions too in the future). From this point you can go on to extract text as a linguistic entity or as a work (for example by using Document Understanding technology to identify titles or marginalia) or even build upon entities in the text, understanding text as a carrier of information.

The wheel demonstrates what aspects of a text can be identified and the direction we are aiming at with the READ project.  We want to provide high-quality Automated Text Recognition but we are also thinking about how to assure the validity and plausibility of text.

Let’s start a discussion that goes beyond the quality of text recognition but rather aims at a theory of Automated Text Recognition.

——–

By Dr Tobias Hodel, University of Zurich and State Archives of Zurich.

SHARE THIS ARTICLE

Recent Posts

July 4, 2022
HTR models
The latest addition to the long list of Transkribus public models comes from the National Archives of Norway. Thanks to ...
June 20, 2022
Transkribus
By Joe Nockels, University of Edinburgh As part of his PhD research at the University of Edinburgh and National Library ...
June 15, 2022
Transkribus, Webinars
We are excited to launch our new “Ask Us Anything” webinar series, where attendees can ask us about anything related ...