Transkribus wasn’t developed overnight. In fact, it was the result of decades of hard work. And although different people have contributed to Transkribus’ development over the years, there is one man who has been there since the very beginning: Günter Mühlberger. Originally a researcher in German language and literature, Günter first became interested in digital humanities in the late 1990s, when the internet was still in its infancy and the idea that a computer programme could automatically transcribe thousands of handwritten documents at the click of a button seemed like just a dream.
Fast forward two decades and that exact programme is now used by people around the world to conduct meaningful research into historical documents. As the chair of READ-COOP, Günter is responsible for making sure that Transkribus continues to develop and help these researchers with their work. We sat down with Günter to find out more about the history of Transkribus and discover what is next for the platform.
It all began with a Christmas party
The late 1990s was a time very different to now. Internet and emails had just been introduced, radically changing the way universities function and opening up a myriad of opportunities for researchers.
One of those researchers was Günter Mühlberger at the University of Innsbruck. The German language scholar already had some experience in the growing field of digital humanities but a discovery one Christmas proved to be the spark for a new kind of project.
“We had the work Christmas party at 7 pm, and I had an hour before at the office. I was searching a bit on the internet and found out that the EU had a programme called ‘Telematics for Libraries’,” Günter explained. “I immediately thought of the newspaper clipping service at our department, which used to cut out interesting newspaper articles on literary issues from a number of German newspapers and store them in a large archive.”
“So at the Christmas party, I approached the head of this archive and said I had an idea for digitising this kind of collection and that we might get some money from the EU. And we decided to go for it.” The team put in an application and while they weren’t granted the full amount, the EU gave them enough funding to get this first large OCR project off the ground. They managed to create a system that could digitise the newspaper cuttings and store them digitally, as opposed to physically. “Everyone wanted to have the project, and it was clear that this was the start of something bigger.”
Creating the ALTO format
That something bigger came in the form of a second OCR project, called the Metadata Engine. In English-speaking countries, libraries had been using OCR to digitise books for some time. But in German-speaking countries, most books until 1942 were printed in the Fraktur script, and there weren’t yet any OCR engines capable of recognising Fraktur. So Günter and his team set out to solve this problem with the Metadata Engine.
“There was no solution already around so we invited the ABBYY company to develop the first OCR engine for Fraktur. At that time, the digital data that came out of the engine was mainly full text but did not contain all the internal information, such as the coordinates of the words. We were of the opinion that we needed an open format containing all this data too, so that we could work with it later on.”
The team put their heads together and came up with the Analysed Layout and Text Object (ALTO) format, which allowed text and layout to be stored in such a way that it could serve several use cases, such as displaying text and image together, just like Transkribus does today.
To publicise the work done in the project, the team went on a tour of libraries in the US. “We went to Harvard, Stanford, the New York Public Library and even to the Library of Congress in Washington DC, where we had an audience of nearly 450 people.”
“It didn’t start well. We were stuck in a traffic jam and were more than an hour late getting to the venue. Then the projector didn’t work and people had to wait for another half an hour. But despite all that, everyone listened so carefully and it was really great just to talk to everyone, to explain what we were doing. And soon afterwards, the Library of Congress decided to implement the ALTO format into their systems, which was a really big achievement.”
Turning OCR into HTR
On the back of the success of the Metadata Engine project, Günter then took part as a sub-project leader in another large OCR project, coordinated by the Royal Library of the Netherlands. The IMPACT project focused on recognizing ancient books and newspapers. “It was a really large project with 12 million euros,” Günter explained. “But it more or less failed completely, because it was too focused on trying to improve the old technology.”
Unlike today’s HTR technology, traditional OCR technology worked by using a series of templates for each character. If the OCR system was presented with an image of a new character, it compared the shape of the character against all the different templates and chose the one it was most similar to.
“But with complicated characters, such as handwritten ones, this technology doesn’t work. The characters are just so different from the templates that the system can’t identify them. And this makes recognising handwritten documents very difficult.”
Thankfully, a team from IBM was also involved in the project and they came up with an intriguing solution. “They had the idea of isolating single words and then presenting the digital version of the word to the user. The user can then correct any mistakes in the transcription and this information goes back to the engine to improve the whole thing. This is the very idea that Transkribus is based on and you could say it was the beginning of the platform.”
A winning collaboration
The IBM team weren’t the only ones working on this kind of technology. The Technical University of Valencia were also conducting research into new text recognition systems, and they approached the Innsbruck team about a collaboration. “We had a good standing at the EU, and there was a new call for the digitisation of cultural heritage. Valencia drafted a proposal, it was accepted, and together with several partners such as University College London, the TranScriptorium project started at the beginning of 2013.”
TranScriptorium was the first real project into handwriting recognition. Back then, the technology was a lot slower — it took roughly 20 minutes to recognise just a single page. But the biggest difference between then and now was that all the Ground Truth was generated in-house by the team. There was no way for the user to input or train their own ground-truth data.
“I realised from the very beginning that it would be a lot of work to generate Ground Truth for the learning algorithm. Also, that we would need a user tool for this so that Ground Truth could be easily created, as well as gathered in a standardized format and a central place. Sebastian Colutto created a Java tool for the Ground Truth creation which was then connected to a central server, where all the Ground Truth could be stored.”
This rudimentary tool was effectively the first Transkribus user interface and set the groundwork for the platform to come. “The very first version went online in February 2015. In the following summer, we made it public and people liked it. They liked that you could have an automatic transcription but without losing that connection to the image.”
Creating a virtual research environment
While the TranScriptorium project was taking place, another interesting EU project call appeared. “They were providing funding for the creation of virtual research environments and that was exactly what we were doing. So we drafted a proposal and it was the only proposal out of about 70 or 80 to receive the maximum score of 15 points. This gave us the chance to realise our idea on the basis of a public investment of 8.2 million euros.”
This idea was to create a platform that would allow users to get automatic transcriptions of handwritten documents and train AI models that could read specific types of handwriting. In other words, the team wanted to make Transkribus a reality.
“We had promised to get the platform up and running from the very first day of the project which was the 1st January 2016.” From this point on, Transkribus only grew in popularity. At the first-ever Transkribus user conference in 2017, the CITlab team from the University of Rostock together with the Planet AI company demonstrated the new baseline recognition technology, which would greatly improve layout analysis and went down very well with the 120 conference attendees.
“Soon afterwards the CITlab team also introduced the new HTR+ engine, which was 40-50% better than the previous one. Before, you had about a 15% character error rate. But with the same training data and the new engine, you got a 7-8% error rate. And the response was overwhelming. With the new and improved error rate, Transkribus suddenly became a viable option for the majority of researchers. Then the team from Valencia introduced PyLaia – an open-source HTR engine which is now the core engine in Transkribus.
The founding of READ-COOP
By this point, it was clear that Transkribus was here to stay. But the question arose: who would be responsible for the platform? Who would fix bugs and maintenance issues and develop the platform further? Back then, everything was based at the University of Innsbruck. However, as only a small percentage of users were from Austria, it was unlikely they would want to host it forever.
It was also important to make sure that all the project partners had their say in the management of the platform they had all worked so hard to create. The solution was to create a cooperative so that ownership could be shared among stakeholders. “The idea was that it could be a kind of shared service but with a commercial impact, so we could pay for the maintenance and development of the platform. However, back then, none of us really knew any details about cooperatives.”
And setting it up proved to be harder than the team had imagined. “We were pretty much the first European cooperative to be set up in Austria, so there was a lot of bureaucracy to deal with.” Then there was the question of money. The team needed to raise a certain amount of money to set up the cooperative, and project partners were asked to become “founding members” for a modest fee. “Finding enough founding members to do this wasn’t too difficult. What was more difficult was getting them all in the same room at the same time to sign the papers.”
In the end though, thanks to a lot of patience, hard work, and bureaucratic know-how, the Austrian courts finally signed off on the cooperative. In July 2019 — over 20 years after Günter first had the idea for his “telematics” project — READ-COOP became the official guardians of the Transkribus platform.
20 years of digitisation success
The last two decades have been an exciting time for handwriting recognition, and Günter Mühlberger’s projects have been at the forefront of that technology. We asked Günter what he is most proud of during that time.
“I’m proud of two things. Firstly, that we have such a great team working on this. Secondly, that so many people today use Transkribus for their research. My role in this whole thing was to have the feeling that this is the right moment, that there are people out there with the right technology, and that we can combine skills and create a tool that will help a lot of people not only in the academic and archival sphere but also with their family documents.”
“For the future, I hope we will continue to be able to support people in this way. Only a very small percentage of the documents in the world are digitised and there are still so many interesting documents out there waiting to be discovered: Exploring them with HTR will be a big boost to historical research.”
Thank you, Günter, for talking to us!