+ Unleashing the Trankribus API

September 7, 2018
HTR models, News, Success stories, Transkribus

by David Brown and Stephen Crane, Trinity College Dublin

On 30 June 1922, at the outset of the Irish Civil War, a cataclysmic explosion and fire destroyed the Public Record Office of Ireland at the Four Courts, Dublin. Flames and heat consumed seven centuries of Ireland’s recorded history, stored in a magnificent six-storey Victorian repository known as the Record Treasury. On the centenary of the 1922 blaze, the Beyond 2022 project at Trinity College Dublin will unveil Ireland’s Virtual Record Treasury—a digital reconstruction of the Public Record Office of Ireland building and its collections.

Large parts of these collections were copied prior to the fire: the work of antiquarians, historians and publicly funded projects that intended to publish the most historically significant parts of the collection as printed source material for scholars. For various reasons, only a small proportion of what were huge transcription projects were ever published, but copies survive in manuscript running to millions of pages of handwritten text. The transcriptions were made between the seventeenth and nineteenth centuries in the trained secretarial hand of the times. Most projects were entrusted to a single transcriber, usually an expert in a particular field and some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.

19th Century transcription of late 16th Century patent roll by the Irish Record Commission for the unpublished ‘Acta Regia’. Courtesy of the Russell Library, Maynooth University: Renehan Collection, Vol. 3, p. 14.

The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: The Elizabethan conquest and Administration, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.

Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each, beginning with the nineteenth century hands and achieving, in some cases, a Character Error Rate (CER) of less than 2%! As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially recognise a sample from the next set of documents, and speed up the process of creating each subsequent set of ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.

Transkribus comprises a cross-platform client GUI which is downloaded and executed on users’ local machines, Windows, Mac or Linux. This GUI communicates with a remote server over the Web. The server allows to manage collections of documents, train HTR models and run models against document collections, all in response to user-requests through the GUI.

Unusually, the Transkribus project has separately published an open-source client library which the GUI uses to make requests to the server. As part of a summer project we decided to use this library as the basis for a scripting language, allowing us to write mini-programs (scripts) automating common tasks separately from the GUI, but using the same back-end services as it.

The client library as shipped is written in the Java programming language, which runs on a virtual machine known as the JVM, and which enables the client to be cross-platform. We decided to base our scripting language on Clojure, an idiomatic modern Lisp which also runs in the JVM and provides excellent Java interoperability.

Our scripting language, which we call Transkript, is also published as open-source, on Github. It does not implement all of the underlying API, just enough to enable a couple of small scripting applications: eval-models and run-ocr.

The first script compares multiple trained models associated with a collection, using the first page of a specified document. Using the GUI this would be a laborious affair since running each model takes some time. A user can run our script and return later to browse the results.

The second script is used to upload a folder of images representing pages of a typewritten document, run OCR on it, and download the text output of the OCR process.

The power of our approach is that each of these scripts took only a couple of hours to write and test, and the core of each of them is about a dozen lines of fluent code, which is quite comprehensible, even to relatively non-technical users. The scripting language does not add any new functionality to Transkribus, but enables dramatically increased productivity through the batch processing of large numbers of jobs. There are multiple additional scripts that can be employed, for example to HTR documents automatically once the most appropriate model has been identified by the eval-models script.

For more information on the Beyond 2022 project contact David Brown, brownd4@tcd.ie
For more information on Transkript contact Stephen Crane, jscrane@gmail.com
Transkript can be found at: https://github.com/jscrane/transkript

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

+ Unleashing the Trankribus API

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community