The Transkribus platform allows users to train Handwritten Text Recognition (HTR) models to automatically transcribe their documents. Many public models, trained by the Transkribus community, are already available and can be used by every Transkribus user: you can find the list here. However, if no public model works well on your documents, you can train a customised Text Recognition model to recognise your documents’ specific script.
HTR models need to be trained to recognise a certain style of writing by being shown images of documents and their accurate transcriptions. This page explains how to use Transkribus Lite to train and apply a HTR model to automatically transcribe your documents.
Before starting the training of a HTR model, you need to prepare the training data, i.e. the images and the corresponding accurate transcriptions on which the HTR will learn.
Depending on the type of material and the hands, between 5,000 and 15,000 words (around 25-75 pages) of transcribed material are required. A smaller amount of training data is usually required if you are working with printed rather than handwritten text.
The neural networks of the Handwritten Text Recognition engine learn quickly; the more training data they have, the better the results will be.
To create training data for HTR in Transkribus:
- Go to the Tools left-side menu and click on “Create a collection”
- After entering the name and creating the collection, upload your images (.jpeg or .png) or pdfs.
- After the upload, select the pages/documents you want to use for the training and run the Layout Recognition by clicking on the “Layout Recognition” button in the Tools left-side menu. The Layout Recognition creates the correspondence between the lines in the image and the lines in the text editor.
- Open a page with the “Edit” button and transcribe it. When your transcription is complete and accurate, save the page as “Ground Truth” (status used to indicate the pages to use for training). Continue the transcription for all the pages to include in the training data.
Once you have between 25 and 75 transcribed pages, it is time to train the Text Recognition model. Watch the video or read the instructions below to understand how to start the training.
Click on the “Training” tab at the top, to the right of “Workdesk”. This area is dedicated to the training of both Text Recognition and Baselines models. In this case, we are interested in training a Text Recognition model, which is selected by default when opening.
Then, you need to select the collection containing your training data. Type the collection title or collection ID and select it.
Be aware that you can not select documents from different collections during the training. To overcome this problem, before starting the training, you can link the documents to only one collection by clicking the three dots at the bottom of each document thumbnail in the collection-overview page.
After having selected the collection, the proper training setup starts. It is divided into four sections:
1. Model Setup
Here, you are asked to add the metadata of your model, in detail:
- Model Name (chosen by you)
- Description of your model and the documents on which it is trained (material, periods, hands…)
- Language(s) of your documents
- Time span of your documents
You can then decide which transcript version to use for the training: the latest transcription or Ground Truth only. With the first option, all the latest transcripts, regardless of how they were saved, are displayed and can be selected for training. If you choose “Ground Truth only”, only the pages saved as Ground Truth are selectable.
2. Training Data
During the training, the pages are divided into two groups:
- Training Data or Training Set: set of examples used to fit the parameters of the model, i.e. the data on which the knowledge in the net is based. The model is trained on the pages selected as Training Data.
- Validation Data or Validation Set: set of examples that provides an unbiased evaluation of a model, used to tune the model‘s parameters during the training. In other words, the pages of the Validation Set are set aside during the training and are used to assess its accuracy.
We recommend that your Validation Set is about 10% of the Training Set. The pages in your Validation Set should be representative of the documents in your collection and comprise all the examples, otherwise, the measurement of the model’s performance could be biased.
Select here the pages to include in the Training Data. By ticking the box near the document’s title, you can select all the transcriptions available in the document. But you can also expand the document’s content and select only some pages. The selected pages will be listed on the right.
The pages which do not contain any transcription can not be selected. To view the page in a new tab, click the eye icon.
3. Validation Data
In the next section, select the pages to assign to the Validation Data. Remember that the Validation Data needs to be varied and should possibly contain all types of elements of the documents included in the Training Data. We recommend not to safe effort at the Validation Set and to assign around 10% of your transcriptions to it.
You can select the pages manually or assign them automatically. The manual selection works as described above for the Training Data. Only the pages that contain text and have not been included in the Training Data are selectable. With the automatic selection, 2%, 5% or 10% of the Training Set is automatically assigned to the Validation Set: in this case, simply click on the percentage you want to assign. The automatic selection is recommended to have a more variable Validation Set.
The last section contains an overview of the model configuration. Here, at the bottom of the page, it is also possible to modify two advanced parameters:
Number of Epochs
The number of epochs refers to the number of times that the Training and Validation Data is evaluated. In this case, the number indicates the maximum number of trained epochs because the training will be stopped automatically when the model no longer improves (i.e. it has reached the lowest possible CER). To begin with, it makes sense to stick to the default setting of 250.
The value of 20 means that if, after 20 epochs, the CER of the Validation Set does not go down, the training will be stopped. If there is no or little variation in the Validation Data, the model may stop too early. For this reason, we recommend creating a varied Validation Set that contains all types of hands and document typologies of the Training Set.
Only if your Validation Set is rather small, please increase the “Early Stopping”-value in order to avoid the training from stopping before it has seen all the training data.
After checking all the details and eventually changing the advanced parameters, click “Start training” to launch the training.
You can follow the progress of the training by clicking the “Jobs” button in the “Transkribus Organizer” left-side menu. The completion of every epoch will be shown in the Job’s description, and you will receive an email when the training process is completed.
Depending on the traffic on the servers and the amount of material, your training might take a while. In the “Jobs” window, you can check your position in the queue (i.e. the number of trainings ahead of yours). You can perform other jobs in Transkribus or close the platform during the training process. If the Job status is “created” or “running”, please don’t start a new training, but just be patient and wait.
After the training
After your model’s training is finished, it will be available among your private models. In order to access it, go to the “Training” tab and click “Model Manager”: here, you can browse all the public models and your private ones. To find your frequently used models quickly, click the Star to the right of the model name, and the model will appear in your “Favorite Model” list.
When you select one model, the model metadata appears on the right:
- the model name;
- its creator;
- the number of words on which it was trained;
- when it was trained;
- the language(s) of the document used for the training;
- the type of material (handwritten/printed);
- the Character Error Rate on the Validation Data;
- the model ID.
The performance of a model is determined based on the “distance” between a perfect transcription and the recognised text, and it is measured by the Character Error Rate (CER), i.e. the percentage of characters that have been transcribed incorrectly by the Text Recognition model.
The CER indicated here is measured on the pages of the Validation Data and shows how the Text Model performs on pages that it has not been trained on. Results with a CER of 10% or below can be seen as very efficient for automated transcription. Results with a CER of 20-30% are sufficient to work with powerful searching tools like Smart Search. For more details, see our How to Search Documents with Smart Seach.
Clicking on “Description”, you can read the description added by the model’s creator and see the Learning Curve of the model.
The “Learning Curve” graph signifies the accuracy of your model. The y-axis represents the Character Error Rate. The curve goes down as the training progresses and the model improves. The x-axis represents the Epochs, i.e. the training progress. During the training process, Transkribus makes an evaluation after every epoch. In Figure 5, 109 epochs were trained. In this case, the maximum number of epochs was set to 250, but the training automatically stopped at 109 because the model no longer improved.
The graph shows two lines, one in blue and one in green. The blue line represents the progress of the training. The green line represents the progress of evaluations on the Validation Set.
As soon as the training is finished, you can try out your model on any other historical document with similar writing. The results will depend on how similar and how clear the writing in the historical document is.
Now that you have your model, you can use it to automatically generate transcripts of your documents.
After uploading the document, select the whole document or the pages you are interested in transcribing. Click then on “Text Recognition” in the Tools left-side menu and choose the model you want to apply.
The top bar shows how many credits you will use for the job and features two additional options that you can enable only before starting the recognition:
- Smart Search: enables you to perform a more advanced and powerful type of search on the documents. The standard search goes through the transcription as it appears in the text editor; with Smart Search, Transkribus stores many possible alternatives for each word and makes these available for searching. Thus it is possible to find search terms that would not be found with a regular full-text search that stores only one candidate per word. For more details, see our How to Search Documents with Smart Seach.
- Language Model: is created automatically during the training of the Text Recognition model, and it is based on the Training Data. It can be added to the recognition process, but the effect needs to be tested in the individual case: in many cases, the language model improves the recognition, but so far, we also see cases where it does not.
To launch the recognition, click “Start”. You can check its progress by clicking the “Jobs” button in the “Transkribus Organizer” left-side menu. When the recognition is finished, open or reload a page, and the text will appear to the right of the image.