Success Story
Published: 3 years ago

How to make a complete collection accessible with Transkribus. A best-practice example from the Tyrolean State Archives

From the digitization with the ScanTent, to the use of artificial intelligence for the recognition of the handwritten forms through to the publication of the documents using read&search – with the project “Military Personnel Master Data Sheets” of the Tyrolean State Archives, READ-COOP SCE pulled out all the stops with its Transkribus platform. Now, more than 98,000 files are available to the public. They reflect the fates of around 90,000 young soldiers from the historical region of Tyrol who did their military service in the Austro-Hungarian army in the late 19th century. This unique collection is of great importance for scholars and many family historians and has already been in great demand.

Christoph Haidacher, Director of the Tyrolean State Archives, comments: “For the Tyrolean State Archives, the cooperation with Dr. Günter Mühlberger and READ-COOP SCE offered a unique opportunity to make a collection that is in great demand throughout the European region of Tyrol accessible online, beyond the confines of our reading room, and thus to all interested parties. The availability of selected archive holdings on the worldwide web in conjunction with artificial handwriting recognition means pioneering a path that the Tyrolean State Archives are happy to pursue with its partners.

The project was carried out in several stages: In January and February 2020, the data sheets from South Tyrol and Trentino were scanned, and in February and March 2021, the sheets from North Tyrol were scanned – together, a little more than 98,000 individual scans. Each scan represents the first page of a personnel file.

Registration sheet of the Imperial and Royal Army.

The file usually consists of several sheets, which were bound with staples or adhesive tape. Due to an otherwise much-increased workload, only the file’s first page was digitized in each case. Scans of the remaining pages can be ordered for a small fee within the framework of an on-demand procedure.

Several ScanTents were used for the digitization, usually five to seven students were working at the same time.

Digitisation using ScanTent and smartphone

Using the DocScan app, the pages were scanned and cropped, and immediately uploaded into the Transkribus platform. The naming of the files was done according to the labeling of the original boxes. The entire collection could thus be completely digitized in less than 20 working days.

In a further step, a special model was trained to recognize the layout. Here, the P2PaLa tool was used. Since the military personnel master data sheets, “Grundbuchsblätter” in German, are complex forms that have changed over the years and are sometimes very inconsistent, handwriting recognition was deliberately limited to the name, birthplace, and year of birth. These three fields, present in all form types, were trained on several hundred examples. The fields are identified by the software quite reliably, as the following example shows (“Name” = name, “Ort” = (birth-)place, “Jahrgang” = year of birth).

Text and structure recognition with Transkribus

In parallel to creating the layout model, a special handwriting model was trained to recognize the actual family and first names, i.e., the writing within the previously identified “name” fields. For this purpose, the first 10 pages of each document in the South Tyrolean collection were transcribed manually and some documents in their entirety, as training data. Furthermore, name data from other projects, such as the one on the Franciscan Cadastre, were also added, and the general model “Transkribus German Kurrent M2” was used as a base model. Only the name field was recognized with this special model – a feature that is probably still far too little known.

Recognition for a structural field

Transkribus allows different models to be applied to individual structural fields. The two remaining fields, i.e., place of birth and year of birth, were recognized with the standard “German Kurrent M2” model. The recognition quality of about 89% is satisfactory given the rather challenging form fields, the many different font types and writers, and the relatively small set of training data. The automated processing, which was carried out directly in the platform, took several weeks, including waiting times due to the many individual jobs.

The last part then consisted of making the data available using the “read&search” web interface. This process could be completed relatively quickly, as the documents were already in Transkribus and what was left was setting up the according website with the required background images, description texts, fonts, colors etc.

The military personnel master data sheets of the Tyrolean State Archives can now be accessed online at: http://transkribus.eu/r/grundbuchblaetter/.

A crowd-sourcing component is to be added to the project in the coming months. Through the “citizen@science” tool, which is currently being developed, interested users will be allowed to volunteer to help correct the recognized text as well as complete the remaining form data, thus significantly enriching this valuable historical resource further.

Overview