Using Transkribus for OCR with printed books

Transkribus might be known for its ability to transcribe and enrich handwritten documents, but did you know you can also use Transkribus as OCR software too? OCR stands for optical character recognition and it is a technology used to transcribe the text in images, just like Transkribus does. However, the difference is that OCR systems are usually only capable of transcribing printed texts, and not handwritten texts such as historical documents.

The advantage of handwritten text recognition (HTR) systems like Transkribus is that they are capable of transcribing both handwritten texts and printed texts. In fact, there have been several Transkribus projects which have involved the large-scale digitisation and transcription of books and other printed texts. So if you also looking to digitise a collection of printed books, here is everything you need to know about using Transkribus as an OCR service.

What is OCR?

As mentioned in the introduction, there is one core similarity between optical character recognition (OCR) and handwritten text recognition (HTR) platforms: they both convert an image of a document to text. You can upload a scanned page from a book to the platform and it will turn the printed words into a digital text file.

The difference between the two systems is the technology behind the text converter. As the name would suggest, OCR is based on character recognition. Traditional optical character recognition software is basically like a giant database of all the possible characters in all the possible fonts. The OCR engine detects the characters in the image and then, using a technology called pattern recognition, it runs the extracted characters through the database, checking them against each stored character to see how similar it is. Once it finds a match, this character is inserted into the transcription.

New Zealand Alpine Journal No. 12 (1922) , via New Zealand Alpine Club

Why would you use HTR for printed texts?

OCR technology is fine for printed text, where there is a finite number of possible fonts and characters. However, handwritten text has an infinite number of possibilities. No two people write in exactly the same way, and even the same person might write differently in different situations — for example, on a form as opposed to a shopping list. This makes it challenging for OCR algorithms, as they are constantly presented with information very different to anything they’ve seen before. Even OCR systems with intelligent character recognition or more advanced optical word recognition often struggle with handwritten texts.

This is where handwritten text recognition, or HTR, comes in. HTR technology, such as Transkribus, is a more advanced form of OCR that uses machine learning to learn how to read different types of handwriting and go on to make educated guesses about handwriting it has never seen before. But although you don’t need HTR to process printed texts, it has several benefits over OCR. For example, it is a more sophisticated and accurate technology, which uses several different strategies to decipher the text in images, as opposed to just comparing it to pre-set templates.

But the biggest advantage of Transkribus over regular OCR systems is that it can be tailored to your specific text through the use of AI models. These models have been specially trained to read certain types of printed texts — for example, German books printed in Fraktur — and you can choose to perform your text recognition with one of over a hundred models. Because the system doesn’t take a one-size-fits-all approach, you can customise the platform according to the kind of printed texts you are working with, resulting in more accurate transcriptions.

How to perform OCR with Transkribus

Step 1: Scanning the book

The first step in the OCR process is to scan in all the pages you want to transcribe or extract text from. There are several ways to do this, from using a high-end scanner to simply taking an image with your smartphone.

If you are using the latter method, you could consider using the ScanTent. This innovative product provides the optimal lighting environment for making high-quality images of books and documents. You simply place the material you want to scan in the tent, attach your smartphone to the mount on the top and take an image in the same way as normal.

In addition, you can use the DocScan app. While the app can be used with any sort of document, it is particularly useful for books as it automatically registers when you turn a page and takes a new image after every turn. This enables you to scan entire books quickly, without having to constantly press buttons on your phone’s touchscreen.

The Bibliothèque Nationale de France in Paris now offers ScanTents to all their visitors, so that they can easily make images of materials in the library. You can find out more about that in this blog post.

Scantent © Transkribus

Step 2: Uploading scanned documents

Once you have your scans, you need to upload them to Transkribus. You first need to set up an account and log in. Then you need to create a collection to store your scans in. If you are conducting a larger project with many different books, then it makes sense to create a separate collection for each book.

You can upload scans in JPEG, PNG, or PDF format to Transkribus. If you have used the DocScan app (see above), then you can automatically upload your scans to Transkribus, without having to download them first.

If you are working with private or sensitive information, you will be pleased to hear that all documents uploaded to Transkribus are private by default. They are stored on the servers of READ-COOP SCE (i.e. the company that develops and maintains the software) in a GDPR-compliant manner. The servers are all located in Innsbruck, Austria, and the data may be processed according to the terms & conditions on the READ-COOP SCE website.

Step 3: Choosing a public model

Before you can start the text recognition process with Transkribus, you have to choose an AI model. This model is like a guidebook, telling the software how to transcribe the individual characters in each document. Therefore, the model you choose will affect how Transkribus transcribes the text in your books.

Luckily, because printed texts are relatively easy for HTR platforms to transcribe, there are many very efficient public models available in several different languages. You can see all the public models on our website and can filter according to language and text type (handwritten, typewritten or printed). This should bring up all the relevant models for performing OCR on your books.

Step 4: Running the text recognition

The final step is to perform the text recognition itself. Open the document or collection on Transkribus and select “Text recognition” from the left-hand toolbar. Then select the right model for your documents and click “Start” to begin the text recognition.

The text recognition may take some time, depending on the size and type of the job. However, you can view the status at any time by selecting “Jobs” from the left-hand toolbar. You can find full instructions for the text recognition here.

German Historical Novals (1789-1848), via Read & Search

How can I publish my transcribed books online?

One of the main motivations for digitising and transcribing printed books is to make them readily available online for everyone to use.

There are various ways you can publish your digitised books online. Often, large organisations such as universities and libraries have their own platforms for publishing digitised material. However, if you do not have access to such a system, it is possible to publish your transcribed books via read&search.

read&search is an easy-to use platform which allows you to publish documents directly from Transkribus. You can simply choose which collections you wish to publish and our team will set up a fully searchable database of those collections. This allows users to quickly search for the information they specifically need, without having to search through the collection. Several printed book collections have been published on read&search, including the NOSCEMUS collection of scientific texts and this collection of historical novels in German.

If you are interested in setting up a read&search for your collection, then you can contact our team here.

Try Transkribus for yourself:

SHARE THIS ARTICLE

Recent Posts

June 12, 2024
News, Transkribus
When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...
May 14, 2024
Uncategorized
Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...
May 2, 2024
News, Transkribus, Uncategorized
If you’re new to Transkribus, you probably have lots of questions about the platform. How do I transcribe documents? What’s ...