+ What is a text? Starting to understand the theory behind Automated Text Recognition

November 20, 2017
HTR models, News, Transkribus

What is a text? A simple question with a not so simple answer. Coming from the scholarly editing tradition, Patrick Sahle, Professor at Albertus Magnus University of Cologne, has demonstrated in detail how different the perception or rather the understanding of text can be: from a string of signs on a paper to a work by a literate individual, that has to be (re)constructed from several versions and prints.

To systematically analyze different aspects of a text, Sahle started drawing the so called ‘text-wheel; (there’s a chapter about this in his third volume on scholarly digital editions, p. 45-55; see also Sahle, Patrick: What is a Scholarly Digital Edition?, in: Matthew James Driscoll and Elena Pierazzo (eds.), Digital Scholarly Editing: Theories and Practices. Cambridge, UK: Open Book Publishers, 2016. OBP.0095, p. 20-39 ).

The result is a range of different entities that a text can be understood as; some of the meanings oppose each other, others do not differ much.

In order to start understanding Automated Text Recognition from a theoretical stand-point, we started discussing with Professor Sahle, how and what form of ‘text’ is recognized in Transkribus (and also in general, if you’re using recognition tools such as OCR engines). The result is our own ‘text-wheel’, drawn by Julia Sorouri.

Most importantly text in Transkribus is understood as signs on a surface; you will need facsimiles or rather digitized images of documents in order to perform Automated Text Recognition. Through interpretation via machine learning (or typing by a human), it’s possible to produce text as it exists as a document (separated into text and line regions, and possibly word regions too in the future). From this point you can go on to extract text as a linguistic entity or as a work (for example by using Document Understanding technology to identify titles or marginalia) or even build upon entities in the text, understanding text as a carrier of information.

The wheel demonstrates what aspects of a text can be identified and the direction we are aiming at with the READ project. We want to provide high-quality Automated Text Recognition but we are also thinking about how to assure the validity and plausibility of text.

Let’s start a discussion that goes beyond the quality of text recognition but rather aims at a theory of Automated Text Recognition.

——–

By Dr Tobias Hodel, University of Zurich and State Archives of Zurich.

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

+ What is a text? Starting to understand the theory behind Automated Text Recognition

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community