Mastering Latin Abbreviations and Hyphenations – The Bentham and DEEDS Projects

July 1, 2021
HTR models

A collaboration between the Bentham Project of the University College London and the DEEDS (Documents of Early England Data Set) Project of the University of Toronto uses Transkribus for the transcription of an immense corpus of medieval charters from the 12^th to the 15^th century. The handwritten Latin of this period is very peculiar and confronted them with two interesting questions:

Could Transkribus be trained consistently to process abbreviated Latin words, which can represent up to half the vocabulary of medieval legal texts, and hence feature in a substantial proportion of the Documents of Early England Data Set (DEEDS) corpus at the University of Toronto?
Could Transkribus be made consistently to recognise hyphenated words which span multiple lines of text (insofar as they are both in Latin and abbreviated)?

To find answers, the team first decided to create their own dictionary of over a hundred abbreviated Latin words, both in their abbreviated and in their expanded form. This was done with the help of the independent programmer Ismail Prada from Switzerland, who coded abbrevSolver-master, a Python script. The contracted form was represented by compatible special characters that best reflect how they appear in type. These abbreviations were also categorized as prefixes, suffixes, or standalone abbreviations, which would alter how they would be processed by the algorithm. However, the method turned out to be problematic, as several versions of the appropriate tab-separated Excel file containing the abbreviated words and several varieties of special characters had to be created in an attempt to get it to function as intended. The only way to solve this problem was to proceed with the finding-and-replacing of the abbreviated words without the use of the script by manually finding and replacing the words. This meant a very time intensive process and was not viable in the long run. With the help of Prada, however, the script was fixed and even a superior API script was developed, which is directly connected to Transkribus after giving it the collection editor’s username and password and the collection ID. The new script is quicker and simpler to use. After running a basic command, the script communicates with Transkribus and uses its find-and-replace algorithm on each subcollection, replacing each term it finds from the abbreviation dictionary with its shorter equivalent and tagging them as abbreviated. At this stage of the project, five new HTR models were created. Over the course of this project the WER and CER both declined in a very promising way and the models which were generated after the new script was created, are extremely good. Additionally, the research team used material from Oxford University and Christ Church to further expand the ground truth and was able to create two more models, which improved the testing results of the DEEDS corpus. On the way to the new models, some obstacles, such as poor image quality and the brevity of the images, made the development even more difficult. However, the #7 model is now available freely for everyone. More than 140 000 words have been trained and the CER on the Validation Set is 0.8% For more details about the project and the developed models visit their website: https://blogs.ucl.ac.uk/transcribe-bentham/2021/04/20/ucl-university-of-toronto-transkribus-htr-and-medieval-latin-abbreviations/

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

Mastering Latin Abbreviations and Hyphenations – The Bentham and DEEDS Projects

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community