How to improve the CER of your model

April 17, 2024
News, Transkribus

One of the biggest advantages of Transkribus is the possibility to train custom handwritten text recognition models. This unique feature allows you to tailor the automatic transcriptions to the specific handwriting or printed text in your documents, resulting in more accurate transcriptions.

However, training accurate models is a skill that takes a bit of time to master. If you are new to model training, you may quickly become frustrated at the high Character Error Rate, or CER, of your model. This is a number between 0% and 100% which shows how accurate the model is. A model with a CER of 100% will produce a very inaccurate transcription whereas a model with a CER of 0% will give a perfect, error-free transcription.

In general, you should aim for a CER of 10% or less. This will produce transcriptions that are accurate enough for search purposes and further analysis. But if your model’s CER is higher than that, don’t despair — there are plenty of easy ways to bring down the CER and create a model that is a good fit for your documents. Let’s take a look at the five easiest ways to improve the CER of your model.

*The CER is shown for each model, along with the language and script. Image via Transkribus.*

What is the CER?

Before we start, let’s take a quick look at what the CER is. The CER is the percentage of characters that were transcribed incorrectly by the text recognition model during testing. If a model has a CER of 5%, this means that, compared to manual transcription, 5 out of 100 characters were incorrectly transcribed by the model — a relatively low number.

But how is the CER calculated? When you create a model, you have to provide two sets of accurate, manually transcribed pages: the training set, which is used to train the model; and the validation set, which usually contains a selection of pages from the training set and is used to test the model. This training data is also known as Ground Truth.

During training, the model analyses all the pages in the training set and tries to learn the handwriting. It then tests what it has learned by attempting an automatic transcription of the pages in the validation set. The model’s automatic transcription of the pages is compared against the accurate manual transcription, and the number of errors is calculated. This is then turned into a percentage and you have your CER.

*The more epochs (shown on the x axis) that are performed , the lower the CER (shown on the y axis) becomes. Image via Transkribus.*

The first time your model goes through this process — known as an epoch — you can expect your CER to be quite high. However, the model will then perform many more epochs, learning more and more each time and making fewer and fewer errors when testing itself on the validation set. Over time, the model will have learned all it can and each epoch will result in the same CER. This figure is taken as the CER of your model.

One other thing…

Keep in mind, that the CER calculates every tiny discrepancy from the training data as errors, including spaces, punctuation, and lower case instead of upper case. It could be that your model has a high CER, but that most of the errors do not concern the actual letters and that the transcriptions are actually quite accurate. Therefore, it is always worth testing the model on a few pages after training, because even a model with a higher CER might still give you a searchable text suitable for your purposes.

Five ways to improve the CER of your model

If your model has completed many training epochs and you are still receiving quite a high CER and inaccurate transcriptions, here are five things you can do to improve the accuracy of your model.

1. Make sure your training data is accurate.

Your training data is the manually transcribed pages you provide for your training set and validation set. They should be 100% accurate and completely error-free.

This is important because the model is only as accurate as the training data it has been given. If there are mistakes in that training data, then those mistakes will be replicated in anything the model tries to transcribe. If you are receiving very high CERs, then it’s worth going through your training data and checking that it is as accurate as possible.

*The more accurate your training data is, the more accurate your model will be. Image from NAF Court Records, via Transkribus*

2. Make sure your training data is consistent.

Likewise, your training data should be consistent. This is particularly relevant if your documents contain abbreviations, unusual punctuation, or other “non-standard” language elements. If these sorts of elements are inconsistently transcribed in the training data, then you risk confusing the model, resulting in a higher CER.

Visit our Help Center for more information about consistency with your training data.

*Being consistent with transcription conventions teaches your model to transcribe in the same way. Image from Marjory Fleming’s Diary, via Transkribus*

3. Don’t forget about baselines.

While it’s easy to focus on just the text part of the transcription, don’t forget about the layout. Before each text recognition, Transkribus performs a layout analysis. This enables the platform to pinpoint the location of the text on the page, so that it knows what to transcribe during the text recognition stage.

It’s therefore important that the baselines (the coloured lines under each line of text) are accurately shown in your training data. That way, the model will only try to find characters in places where they actually exist, creating more accurate transcriptions. You can find out how to adjust baselines in our Help Center.

*Accurate baselines ensure that the model correctly learns where the text is on the page. Image from “Bulliot, Bibracte et moi” project, via Transkribus*

4. Keep adding more data.

If you’ve gone through your training data and you are sure the text and baselines are entirely accurate and consistent, then the next step would be to add more training data.

In general, we recommend having at least 25 pages of training data for a model. But of course, the more training data you have, the more information your model has to learn from and the more accurate it will be.

This is particularly true if your documents are very heterogeneous, for example, if they have many different types of handwriting. In these cases, more training data may be required to bring down the CER of the model.

*Selecting a base model means your new model doesn’t have to be trained from scratch. Image via Transkribus.*

5. Use a base model.

This last tip can not only improve the CER of your model, but save time too. When setting up your new model, you have the option to select a “base model”. This is a pre-existing model which will be used as a base for your new custom model. Your base model should be trained on a similar language, handwriting and time period as your documents.

Using a base model means that your new model does not need to be trained entirely from scratch. Instead, it can use the information stored in the base model and expand on it with your training data. This usually results in a more accurate model with less training data required, saving you both time and effort.

Need more information about training text recognition models with Transkribus? Check out the Training Models section in our Help Center.

SHARE THIS ARTICLE

The new Team subscription plan — collaboration at its best

Back in January, we announced our new subscription plans: Individual, Scholar, and Organisation. Each plan is tailored to a particular ...

April 17, 2024

News, Transkribus