If you’re new to Transkribus, or machine learning in general, then you are also probably new to the term “Ground Truth”. In short, Ground Truth is the accurate and verified data which is used to train machine learning models, such as those used for automatic transcriptions in Transkribus. And this data is pretty important for the success of your model, as machine learning is just a tool that statistically replicates the data you supply. Therefore, the better your Ground Truth data, the better your model will be.
In this post, we’re going to take a further look at what Ground Truth is, why it is so important for handwritten text recognition software, and how you can best prepare Ground Truth data on Transkribus.
A basic Ground Truth definition
Used in statistics and machine learning, Ground Truth is data that we assume to be true. For example, you have two images. One image depicts a dog, and the other a cat. We know this to be true because we, as humans, have the ability to recognise different animals. If you ask a thousand people which picture contains the dog, they would overwhelmingly point to the same picture.
But for a computer, this task is a lot harder. A computer does not automatically know which animal is which, it has to be taught how to do this. And this is where Ground Truth comes in. If you wanted to train a computer to recognise which photos contain dogs and which cats, you would first have to provide it with a large dataset of images, each labelled as either “photo with dog” or “photo with cat”. From these Ground Truth labels, the computer can learn what images with dogs look like and what images with cats look like and create a model containing this information.
Once that model is sufficiently trained with enough data, it can be presented with a brand new image and should be able to say whether that image contains a cat or a dog, just like a human would. This is why it is called “artificial intelligence”: it is training computers to do intelligent tasks that humans do naturally. And Ground Truth is the basis for this whole process.
Male tabby cat. Alvesgaspar. Public domain, via Wikimedia Commons
Canis lupus familiaris (perro) en Monfero. Fernando Losada Rodríguez. Public domain, via Wikimedia Commons
Distinguishing between cats and dogs is easy for a human, but almost impossible for an untrained computer.
Is Ground Truth just used for training models?
No, it’s also used for testing models. Let’s say you have already created your model for distinguishing between photos of cats and dogs. Now, you want to test how well that model actually works. You can do this by presenting the model with Ground Truth datasets for which you have a “correct answer”, and see if the model can come up with that same answer.
With the cat/dog model, that means you would show the model a series of images which have already been labelled as either “photo with dog” or “photo with cat” and count how often the model correctly assigns the correct label to a photo. This way, you can see how well your model performs.
How is Ground Truth data used in text recognition models?
What if you didn’t want a model that could distinguish between cats and dogs in photos, but that could read and transcribe historical documents? These kinds of models are the core technology behind all handwritten text recognition platforms, and they are trained with Ground Truth in exactly the same way. In this case, the Ground Truth data isn’t images of cats and dogs but images of texts with accurate transcriptions. Using machine learning, the computer learns from the data which characters in the image represent which characters in the transcription. Then, just like before, it then takes this information and uses it to create a model.
Some text recognition platforms only allow you to use models that have been trained by the creators of the platform. What is particularly unique about Transkribus is that it allows users to create their own models and train them to read a particular type of document. Because these custom models are trained on very specific Ground Truth data, they tend to be more accurate at transcribing documents similar to those in the Ground Truth dataset. This is ideal if you have very specific documents, such as letters by a small group of people, a handwritten diary, or notarial documents from a certain time period.
Ground Truth is known as “Training data” within Transkribus. © Transkribus
How do I prepare Ground Truth for a Transkribus model?
As you might have guessed already, training your own custom model requires creating Ground Truth data to train it on. In most cases, you will need at least 10,000 words of transcribed handwritten text or 5,000 words of transcribed printed text to train your first model. However, this varies depending on the type of material and model.
There are three main ways of finding suitable Ground Truth data for models in Transkribus:
- You can manually transcribe documents. This is done by uploading images of the documents to Transkribus and then manually typing out the transcription in the text editor. You can find out more about manual transcription in our Help Center.
- You can find pre-transcribed examples of texts. For example, if a colleague has already been using Transkribus to transcribe documents similar to the ones you are working on, they can share these directly with your Transkribus collection and you can then save them as Ground Truth.
- You can take a public model as a base. Transcribe all your documents with the most suitable public model and then correct those transcriptions to make them more accurate and re-save them as Ground Truth. This will tailor the model to your specific documents, and save a lot of time in the process.
Using a public model as a base for your custom model can save a lot of time and effort. © Transkribus
What is important to remember when preparing Ground Truth?
The success of your model depends on the quality of your Ground Truth data. And when it comes to quality, the most important aspects are accuracy and consistency.
Firstly, your transcriptions should be as accurate as possible. In machine learning, the model automatically assumes that the Ground Truth data it has been given is true. That means that if there are inaccuracies in your Ground Truth, then the model will think that these inaccuracies are correct, and this will affect the accuracy of any documents the model then goes on to transcribe.
Secondly, your transcription should be consistent. There are many different ways to transcribe linguistic features such as diacritics, ligatures, or S-characters. The way you choose to transcribe these features in your Ground Truth dictates the way that your model will go on to transcribe them. Therefore, it makes sense to consistently transcribe your Ground Truth documents in exactly the way you want the rest of your documents to be transcribed so that the model can learn your system and apply it to later documents.
You can find out more about linguistic features to pay particular attention to on this page.
Where can I find out more about preparing Ground Truth and training models in Transkribus?
Preparing Ground Truth and models in Transkribus is an extensive topic, and it is worth doing some research before embarking on your first model. Here are some resources where you can find out more about training text recognition models with Ground Truth data:
- Our Help Center is a mine of information on all aspects of Transkribus. You can check out the section about training text recognition models here.
- The FAQs on our website provide answers to the most commonly asked questions about models and training data.
- We have also prepared the following video as a user-friendly guide to training models in Transkribus: