What is Ground Truth?

May 10, 2023
Transkribus

If you’re new to Transkribus, or machine learning in general, then you are also probably new to the term “Ground Truth”. In short, Ground Truth is the accurate and verified data which is used to train machine learning models, such as those used for automatic transcriptions in Transkribus. And this data is pretty important for the success of your model, as machine learning is just a tool that statistically replicates the data you supply. Therefore, the better your Ground Truth data, the better your model will be.

In this post, we’re going to take a further look at what Ground Truth is, why it is so important for handwritten text recognition software, and how you can best prepare Ground Truth data on Transkribus.

A basic Ground Truth definition

Used in statistics and machine learning, Ground Truth is data that we assume to be true. For example, you have two images. One image depicts a dog, and the other a cat. We know this to be true because we, as humans, have the ability to recognise different animals. If you ask a thousand people which picture contains the dog, they would overwhelmingly point to the same picture.

But for a computer, this task is a lot harder. A computer does not automatically know which animal is which, it has to be taught how to do this. And this is where Ground Truth comes in. If you wanted to train a computer to recognise which photos contain dogs and which cats, you would first have to provide it with a large dataset of images, each labelled as either “photo with dog” or “photo with cat”. From these Ground Truth labels, the computer can learn what images with dogs look like and what images with cats look like and create a model containing this information.

Once that model is sufficiently trained with enough data, it can be presented with a brand new image and should be able to say whether that image contains a cat or a dog, just like a human would. This is why it is called “artificial intelligence”: it is training computers to do intelligent tasks that humans do naturally. And Ground Truth is the basis for this whole process.

Male tabby cat. Alvesgaspar. Public domain, via Wikimedia Commons

Canis lupus familiaris (perro) en Monfero. Fernando Losada Rodríguez. Public domain, via Wikimedia Commons

Distinguishing between cats and dogs is easy for a human, but almost impossible for an untrained computer.

Is Ground Truth just used for training models?

No, it’s also used for testing models. Let’s say you have already created your model for distinguishing between photos of cats and dogs. Now, you want to test how well that model actually works. You can do this by presenting the model with Ground Truth datasets for which you have a “correct answer”, and see if the model can come up with that same answer.

With the cat/dog model, that means you would show the model a series of images which have already been labelled as either “photo with dog” or “photo with cat” and count how often the model correctly assigns the correct label to a photo. This way, you can see how well your model performs.

How is Ground Truth data used in text recognition models?

What if you didn’t want a model that could distinguish between cats and dogs in photos, but that could read and transcribe historical documents? These kinds of models are the core technology behind all handwritten text recognition platforms, and they are trained with Ground Truth in exactly the same way. In this case, the Ground Truth data isn’t images of cats and dogs but images of texts with accurate transcriptions. Using machine learning, the computer learns from the data which characters in the image represent which characters in the transcription. Then, just like before, it then takes this information and uses it to create a model.

Some text recognition platforms only allow you to use models that have been trained by the creators of the platform. What is particularly unique about Transkribus is that it allows users to create their own models and train them to read a particular type of document. Because these custom models are trained on very specific Ground Truth data, they tend to be more accurate at transcribing documents similar to those in the Ground Truth dataset. This is ideal if you have very specific documents, such as letters by a small group of people, a handwritten diary, or notarial documents from a certain time period.

Ground Truth is known as “Training data” within Transkribus. © Transkribus

How do I prepare Ground Truth for a Transkribus model?

As you might have guessed already, training your own custom model requires creating Ground Truth data to train it on. In most cases, you will need at least 10,000 words of transcribed handwritten text or 5,000 words of transcribed printed text to train your first model. However, this varies depending on the type of material and model.

There are three main ways of finding suitable Ground Truth data for models in Transkribus:

You can manually transcribe documents. This is done by uploading images of the documents to Transkribus and then manually typing out the transcription in the text editor. You can find out more about manual transcription in our Help Center.
You can find pre-transcribed examples of texts. For example, if a colleague has already been using Transkribus to transcribe documents similar to the ones you are working on, they can share these directly with your Transkribus collection and you can then save them as Ground Truth.
You can take a public model as a base. Transcribe all your documents with the most suitable public model and then correct those transcriptions to make them more accurate and re-save them as Ground Truth. This will tailor the model to your specific documents, and save a lot of time in the process.

Using a public model as a base for your custom model can save a lot of time and effort. © Transkribus

What is important to remember when preparing Ground Truth?

The success of your model depends on the quality of your Ground Truth data. And when it comes to quality, the most important aspects are accuracy and consistency.

Firstly, your transcriptions should be as accurate as possible. In machine learning, the model automatically assumes that the Ground Truth data it has been given is true. That means that if there are inaccuracies in your Ground Truth, then the model will think that these inaccuracies are correct, and this will affect the accuracy of any documents the model then goes on to transcribe.

Secondly, your transcription should be consistent. There are many different ways to transcribe linguistic features such as diacritics, ligatures, or S-characters. The way you choose to transcribe these features in your Ground Truth dictates the way that your model will go on to transcribe them. Therefore, it makes sense to consistently transcribe your Ground Truth documents in exactly the way you want the rest of your documents to be transcribed so that the model can learn your system and apply it to later documents.

You can find out more about linguistic features to pay particular attention to on this page.

Where can I find out more about preparing Ground Truth and training models in Transkribus?

Preparing Ground Truth and models in Transkribus is an extensive topic, and it is worth doing some research before embarking on your first model. Here are some resources where you can find out more about training text recognition models with Ground Truth data:

Our Help Center is a mine of information on all aspects of Transkribus. You can check out the section about training text recognition models here.
The FAQs on our website provide answers to the most commonly asked questions about models and training data.
We have also prepared the following video as a user-friendly guide to training models in Transkribus:

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

What is Ground Truth?

A basic Ground Truth definition

Is Ground Truth just used for training models?

How is Ground Truth data used in text recognition models?

How do I prepare Ground Truth for a Transkribus model?

What is important to remember when preparing Ground Truth?

Where can I find out more about preparing Ground Truth and training models in Transkribus?

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community