A Short History of Transkribus with Günter Mühlberger

February 22, 2023
Transkribus

Transkribus wasn’t developed overnight. In fact, it was the result of decades of hard work. And although different people have contributed to Transkribus’ development over the years, there is one man who has been there since the very beginning: Günter Mühlberger. Originally a researcher in German language and literature, Günter first became interested in digital humanities in the late 1990s, when the internet was still in its infancy and the idea that a computer programme could automatically transcribe thousands of handwritten documents at the click of a button seemed like just a dream.

Fast forward two decades and that exact programme is now used by people around the world to conduct meaningful research into historical documents. As the chair of READ-COOP, Günter is responsible for making sure that Transkribus continues to develop and help these researchers with their work. We sat down with Günter to find out more about the history of Transkribus and discover what is next for the platform.

Günter Mühlberger played a crucial role in the development of Transkribus.

It all began with a Christmas party

The late 1990s was a time very different to now. Internet and emails had just been introduced, radically changing the way universities function and opening up a myriad of opportunities for researchers.

One of those researchers was Günter Mühlberger at the University of Innsbruck. The German language scholar already had some experience in the growing field of digital humanities but a discovery one Christmas proved to be the spark for a new kind of project.

“We had the work Christmas party at 7 pm, and I had an hour before at the office. I was searching a bit on the internet and found out that the EU had a programme called ‘Telematics for Libraries’,” Günter explained. “I immediately thought of the newspaper clipping service at our department, which used to cut out interesting newspaper articles on literary issues from a number of German newspapers and store them in a large archive.”

The library at the University of Innsbruck, where Günter’s first OCR project took place. © University of Innsbruck

“So at the Christmas party, I approached the head of this archive and said I had an idea for digitising this kind of collection and that we might get some money from the EU. And we decided to go for it.” The team put in an application and while they weren’t granted the full amount, the EU gave them enough funding to get this first large OCR project off the ground. They managed to create a system that could digitise the newspaper cuttings and store them digitally, as opposed to physically. “Everyone wanted to have the project, and it was clear that this was the start of something bigger.”

Creating the ALTO format

That something bigger came in the form of a second OCR project, called the Metadata Engine. In English-speaking countries, libraries had been using OCR to digitise books for some time. But in German-speaking countries, most books until 1942 were printed in the Fraktur script, and there weren’t yet any OCR engines capable of recognising Fraktur. So Günter and his team set out to solve this problem with the Metadata Engine.

“There was no solution already around so we invited the ABBYY company to develop the first OCR engine for Fraktur. At that time, the digital data that came out of the engine was mainly full text but did not contain all the internal information, such as the coordinates of the words. We were of the opinion that we needed an open format containing all this data too, so that we could work with it later on.”

© User:Berteun / Wikimedia Commons / CC-BY-SA-3.0

The team put their heads together and came up with the Analysed Layout and Text Object (ALTO) format, which allowed text and layout to be stored in such a way that it could serve several use cases, such as displaying text and image together, just like Transkribus does today.

To publicise the work done in the project, the team went on a tour of libraries in the US. “We went to Harvard, Stanford, the New York Public Library and even to the Library of Congress in Washington DC, where we had an audience of nearly 450 people.”

“It didn’t start well. We were stuck in a traffic jam and were more than an hour late getting to the venue. Then the projector didn’t work and people had to wait for another half an hour. But despite all that, everyone listened so carefully and it was really great just to talk to everyone, to explain what we were doing. And soon afterwards, the Library of Congress decided to implement the ALTO format into their systems, which was a really big achievement.”

Turning OCR into HTR

On the back of the success of the Metadata Engine project, Günter then took part as a sub-project leader in another large OCR project, coordinated by the Royal Library of the Netherlands. The IMPACT project focused on recognizing ancient books and newspapers. “It was a really large project with 12 million euros,” Günter explained. “But it more or less failed completely, because it was too focused on trying to improve the old technology.”

Unlike today’s HTR technology, traditional OCR technology worked by using a series of templates for each character. If the OCR system was presented with an image of a new character, it compared the shape of the character against all the different templates and chose the one it was most similar to.

“But with complicated characters, such as handwritten ones, this technology doesn’t work. The characters are just so different from the templates that the system can’t identify them. And this makes recognising handwritten documents very difficult.”

Thankfully, a team from IBM was also involved in the project and they came up with an intriguing solution. “They had the idea of isolating single words and then presenting the digital version of the word to the user. The user can then correct any mistakes in the transcription and this information goes back to the engine to improve the whole thing. This is the very idea that Transkribus is based on and you could say it was the beginning of the platform.”

Transkribus was created to make archives more accessible for everyone. © University of Innsbruck

A winning collaboration

The IBM team weren’t the only ones working on this kind of technology. The Technical University of Valencia were also conducting research into new text recognition systems, and they approached the Innsbruck team about a collaboration. “We had a good standing at the EU, and there was a new call for the digitisation of cultural heritage. Valencia drafted a proposal, it was accepted, and together with several partners such as University College London, the TranScriptorium project started at the beginning of 2013.”

TranScriptorium was the first real project into handwriting recognition. Back then, the technology was a lot slower — it took roughly 20 minutes to recognise just a single page. But the biggest difference between then and now was that all the Ground Truth was generated in-house by the team. There was no way for the user to input or train their own ground-truth data.

“I realised from the very beginning that it would be a lot of work to generate Ground Truth for the learning algorithm. Also, that we would need a user tool for this so that Ground Truth could be easily created, as well as gathered in a standardized format and a central place. Sebastian Colutto created a Java tool for the Ground Truth creation which was then connected to a central server, where all the Ground Truth could be stored.”

This rudimentary tool was effectively the first Transkribus user interface and set the groundwork for the platform to come. “The very first version went online in February 2015. In the following summer, we made it public and people liked it. They liked that you could have an automatic transcription but without losing that connection to the image.”

Creating a virtual research environment

While the TranScriptorium project was taking place, another interesting EU project call appeared. “They were providing funding for the creation of virtual research environments and that was exactly what we were doing. So we drafted a proposal and it was the only proposal out of about 70 or 80 to receive the maximum score of 15 points. This gave us the chance to realise our idea on the basis of a public investment of 8.2 million euros.”

This idea was to create a platform that would allow users to get automatic transcriptions of handwritten documents and train AI models that could read specific types of handwriting. In other words, the team wanted to make Transkribus a reality.

“We had promised to get the platform up and running from the very first day of the project which was the 1st January 2016.” From this point on, Transkribus only grew in popularity. At the first-ever Transkribus user conference in 2017, the CITlab team from the University of Rostock together with the Planet AI company demonstrated the new baseline recognition technology, which would greatly improve layout analysis and went down very well with the 120 conference attendees.

“Soon afterwards the CITlab team also introduced the new HTR+ engine, which was 40-50% better than the previous one. Before, you had about a 15% character error rate. But with the same training data and the new engine, you got a 7-8% error rate. And the response was overwhelming. With the new and improved error rate, Transkribus suddenly became a viable option for the majority of researchers. Then the team from Valencia introduced PyLaia – an open-source HTR engine which is now the core engine in Transkribus.

The founding of READ-COOP

By this point, it was clear that Transkribus was here to stay. But the question arose: who would be responsible for the platform? Who would fix bugs and maintenance issues and develop the platform further? Back then, everything was based at the University of Innsbruck. However, as only a small percentage of users were from Austria, it was unlikely they would want to host it forever.

It was also important to make sure that all the project partners had their say in the management of the platform they had all worked so hard to create. The solution was to create a cooperative so that ownership could be shared among stakeholders. “The idea was that it could be a kind of shared service but with a commercial impact, so we could pay for the maintenance and development of the platform. However, back then, none of us really knew any details about cooperatives.”

And setting it up proved to be harder than the team had imagined. “We were pretty much the first European cooperative to be set up in Austria, so there was a lot of bureaucracy to deal with.” Then there was the question of money. The team needed to raise a certain amount of money to set up the cooperative, and project partners were asked to become “founding members” for a modest fee. “Finding enough founding members to do this wasn’t too difficult. What was more difficult was getting them all in the same room at the same time to sign the papers.”

In the end though, thanks to a lot of patience, hard work, and bureaucratic know-how, the Austrian courts finally signed off on the cooperative. In July 2019 — over 20 years after Günter first had the idea for his “telematics” project — READ-COOP became the official guardians of the Transkribus platform.

20 years of digitisation success

The last two decades have been an exciting time for handwriting recognition, and Günter Mühlberger’s projects have been at the forefront of that technology. We asked Günter what he is most proud of during that time.

“I’m proud of two things. Firstly, that we have such a great team working on this. Secondly, that so many people today use Transkribus for their research. My role in this whole thing was to have the feeling that this is the right moment, that there are people out there with the right technology, and that we can combine skills and create a tool that will help a lot of people not only in the academic and archival sphere but also with their family documents.”

“For the future, I hope we will continue to be able to support people in this way. Only a very small percentage of the documents in the world are digitised and there are still so many interesting documents out there waiting to be discovered: Exploring them with HTR will be a big boost to historical research.”

Thank you, Günter, for talking to us!

Transkribus would be nothing without its community. Transkribus User Conference 2022

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

A Short History of Transkribus with Günter Mühlberger

It all began with a Christmas party

Creating the ALTO format

Turning OCR into HTR

A winning collaboration

Creating a virtual research environment

The founding of READ-COOP

20 years of digitisation success

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community