This guide explains how to use Transkribus to train a Handwritten Text Recognition (HTR+) model to recognise your documents. After training the model will help you to automatically transcribe and search your collection.
- The Transkribus platform allows users to train a Handwritten Text Recognition (HTR+) model to automatically process a collection of documents. The model needs to be trained to recognise a certain style of writing by being shown images of documents and their accurate transcriptions.
- For the training of a model between 5,000 and 15,000 words (around 25-75 pages) of transcribed material are required. If you are working with printed rather than handwritten text, a smaller amount of training data is usually required.
- With the use of a base model the amount of required training data can be reduced. As base model you can either use one of the publicly available models in Transkribus, if there is a suitable one for your documents or one of your own models, which you have already trained before. An overview of the currently available public models you can find here: https://transkribus.eu/wiki/images/d/d6/Public_Models_in_Transkribus.pdf
- The model training function is not automatically included in the standard Transkribus platform. When you are ready to train a model, contact the Transkribus team (firstname.lastname@example.org) and they will give you access to the feature.
- We recommend that you start the training process with between 5,000 and 15,000 words of transcribed material, depending on if it is printed or handwritten text. As already indicated, base models can reduce the required amount of training data.
- The neural networks in HTR+ learn quickly and the more training data they have, the better the results will be.
- You can create training data for HTR+ in Transkribus by uploading images and transcribing text. For full instructions, see How To Transcribe Documents with Transkribus – Introduction.
- If you already have existing transcripts, you can also use these to train your model. For more information see How To Use Existing Transcriptions to train a HTR model.
- The main options for the training of a model can be found in the “Tools” tab in the “Text Recognition” section.
- As “Method”, “HTR (CITlab)” is the most effective option to choose.
- By clicking the “Models” button you can see which models are available and which documents they were trained on.
- With the “Train” button you will arrive at the options for the training of models.
Figure 1 Where to find the tools for the training
- To get to the “HTR+ Training” window, click the “Train” button in the “Tools” tab.
Figure 2 How to open the “HTR Training” window.
- The following window will open up:
Figure 3 “HTR Training” window
- In the upper section you will need to add details about your model.
Figure 4 Adding details about the model
- Please add
- Model Name (chosen by you)
- Language (of your documents)
- Description (of your documents and the pages selected as training and test data)
- Note: “Nr. of Epochs” refers to the number of times that the training data is evaluated. If you increase the number of epochs, the training process will take longer.
- It is possible to add a base model to your training. If you choose this option, the information the base model contains will be integrated to the new model. To have a benefit the base model needs to be similar to the writing it should recognise afterwards. With the help of a base model it is possible to speed up the training process. An improvement of quality is not guaranteed, it has to be tested in the individual case.
- One big benefit of working with base models is, that they can make it possible to start with a smaller amount of training pages, which means that the transcription workload would be reduced.
- To use a base model, you simply need to choose the desired one with the “Choose…” button next to “Base Model:”.
- Next, you need to select the pages that you would like to be included in your set of training data.
- To add all the pages of your document to the Training Set, click on the folder and click “+Training”.
- To add a specific sequence of pages from your document to the Training set, double-click on the folder, click on the first page you wish to include, hold down the “Shift” key on your keyboard and then click the last page. Then click “+Training”.
- To add individual pages from your document to the Training Set, double-click on the folder, hold down the “CTRL” key on your keyboard and select the pages you would like to use as training data. Then click “+Training”.
- The pages you have selected will appear in the “Training Set” space.
Figure 5 Adding all the pages for training
- During the training process, a Validation Set of pages is set aside and is not used to train the HTR. These test pages can then be used to assess the accuracy of your model.
- We recommend that you select at least one test page for every 50-100 pages of your Training Set.
- The pages in your Validation Set should be representative of the documents in your collection.
- The more pages there are in your Validation Set, the longer the HTR training will take.
- To add pages to the Validation Set, follow the same process as above but click the “+Validation” button.
Figure 6 Adding pages to the Test set
- To remove pages from the “Training Set” or “Test Set”, click on the page and then click the red cross button.
Figure 7 Removing pages
- You can make a note of the pages used in your test set in the model description box.
- Start the training by clicking the “OK” button.
- You can follow the progress of the training by clicking the “Jobs” button in the “Server” tab.
Figure 8 Check the progress of the training with the “Jobs” button
- The completion of every epoch will be shown in the “Jobs on server” window, as well as the completion of the training process.
- Training a HTR+ model will take at least a couple of days. You can perform other jobs in Transkribus or close the platform during the training process.
Figure 9 “Jobs on server” overview
- After the training of your model is finished it will be available in your collection.
- In order to access it click the “Models” button in the “Tools” tab.
Figure 10 Opening the “Choose a model” window
- The following window will open up:
Figure 11 “Choose a model” window
- On the left side of the window you can see an overview of the available models.
- On the top right side of the window the details of the model are shown.
- On the bottom right you can see the learning curve of your model. More information about these statistics can be found below.
- The “Learning Curve” graph signifies the accuracy of your model
Figure 12 “Learning Curve” of your model
- As you can see in Figure 12 the y-axis is defined as “Accuracy in CER”
- “CER” stands for Character Error Rate, i.e. the percentage of characters that have been transcribed incorrectly by HTR+.
- “Accuracy in CER” is indicated as percentage on the y-axis. The curve will always start at 100% and will go down as the training progresses and the model improves.
- The x-axis is defined as “Epochs”.
- During the training process Transkribus will make an evaluation after every epoch. In Figure 12 the “Training Set” was divided into 20 epochs.
- When you train a model you can indicate how many “epochs” the “Training Set” should be divided into. The more epochs there are, the longer the training will take.
- The graph shows two lines, one in blue and one in red.
- The blue line represents the progress of the training.
- The red line represents the progress of evaluations on the Test Set.
- First the program trains itself on the Training Set, then it will test itself on pages in the Test set.
- Underneath the graph, two percentage values are shown relating to the CER for the Training Set and the Test Set.
- In Figure 12, the model performs with a 14.19% CER on the Training Set and 9.57% on the Test Set.
- The value for the Test Set is the most significant as it shows how the HTR+ performs on pages that it has not been trained on.
- Results with a CER of 10% or below can be seen as very efficient for automated transcription.
- Results with a CER of 20-30% are sufficient to work with powerful Keyword Spotting technology. For more details, see our How To Transcribe – Keyword Spotting guide.
You can measure the accuracy of your model on specific pages from your Training and Validation Sets with the “Compute Accuracy” functionality in the “Tools” tab. To do so, first, an HTR transcript needs to be generated.
As “Reference”, choose a page version, which was correctly transcribed (Ground Truth: manual transcription as close to the original text as possible). To get out the most significant value it would be best to use pages from a sample set which have not been used in the training and therefore are new to the model. Using pages from the Validation Set is also an option even if not as ideal as the just mentioned. Using pages from the Training Set is not a good idea because this will output lower CER-values as they actually are.
As “Hypothesis”, choose the version, which was automatically generated with an HTR-model and on which you would like to see, how good the result is.
You can change the versions to be compared by clicking on the grey button besides “Reference” and “Hypothesis”. Double-click to choose the desired version of the document in the appearing window. The versions, which you can choose for “Reference” and “Hypothesis” are different versions of your document, which have been created after running a new job or saving transcriptions.
Figure 1 “Compute Accuracy” within the “Tools”-tab
Figure 2 Choosing the right version by double-clicking
Compare text versions
If you click on “Compare Text Versions” you will get a visual representation of what the HTR transcribed correctly and incorrectly.
Figure 3 Compare Text Versions
Please note, that even if only one character is wrong, the whole word is marked in red. In green the word is shown, as it is written in the Ground Truth transcription. In the passages without colour the recognition text is identical with the Ground Truth.
“Simple” accuracy check
This accuracy check is the quick version. To access it, click on “Compare…”.
First please make sure the right versions have been chosen in the upper section of the appearing window.
Then hit the “Compare”-button. The result will be show in the lower section of the window after a few seconds.
Figure 4 Results
The values are calculated for the page you have currently loaded in the background. In the example we have a CER of 2.34% on that page, which means that 97.66% of the characters are correct in the automated transcript.
By double clicking the date and time within the “Created” column of the simple comparison, you will automatically arrive at the “Advanced Statistics”.
Figure 5 Advanced Statistics
Here you will get more detailed indications and values and the results can be exported into an Excel-file.
In the overview you have two tables: one with the “Overall”-value, which are the average values of the recognition on all pages in a document. In the table below you can find the values for the individual pages. This way you can compare the results on different pages and by double-clicking the line you will arrive at the text-comparison, where you can check, which words or text passages have been challenging.
Note: The weight of pages for the “Overall”-value is calculated based on the number of recognised words on a page.
When opening the “Compare”-window you can choose another tab “Advanced Compare”.
Figure 6 Advanced Compare
With “Advanced Compare” you can check the accuracy for more pages at once by adding the pages you would like to evaluate (e.g. 1-6). By clicking on the button with the three points far right you can choose individual pages.
After starting the accuracy check by clicking on “Compare”, the results will be shown in the table below and by double clicking the value in the “Created”-column you will arrive at the “Advanced Statistics” again.
The sample compare functionality is useful if you are planning a bigger recognition project and would like to evaluate the model before you run it on the whole document. The sample compare chooses random lines from the sample document and tests the performance of the model on those.
It makes sense to put pages aside already at the beginning to use them as samples. The advantage is, that the material, the model will be tested on, it has not seen before and therefore the evaluation result will be more reliable.
You can find the “Compare Samples”-functionality within the “Tools”-tab in the “Compute Accuracy”-section. To open it, click on “Compare Samples”
Figure 7 Sample Compare
At “Nr. of lines” you can define how many lines you would like to test. 500 are a recommended average. The more lines you put here the lower the variation in the result will be.
From the list on the left side choose the collection and document of which the sample should consist of. Then click on “Create sample”. Transkribus will now randomly choose the defined number of lines in the selected documents.
The next step is to load the sample document (you can find it in your collection) and manually transcribe the line snippets. It will be only one line per page and therefore the transcribing in most cases will be quick. If you have finished one line jump to the next page of the sample document to proceed.
After finishing transcribing run the model which you would like to test on the sample document to produce the transcription you can then compare with the sample compare.
To do so, open the “Sample”-tab in the “Compare samples”-window and choose the document, you would like to evaluate. Then click on “Compute” to start the job. As soon as “Completed” will appear in the “Status” column you can double click the cell in the “Created”-column to display the results.
Figure 8 Generating results with the sample compare
- Now that you have your model, you can use it to automatically generate transcripts of the documents in your collection.
- First, upload your documents to Transkribus.
- Second, segment your documents into text regions, lines and baselines.
- For more information on uploading and segmentation, please consult How To Transcribe Documents with Transkribus – Introduction.
- To access your model, click on the “Tools” tab and go to the “Text Recognition” section.
- Click “Run”, then click “Choose HTR-model”. Choose your HTR model from the list on the left-hand side of the screen and click OK.
- Select whether you wish to generate a HTR transcript of one page or several pages.
- Press “Run” to start the text recognition process.
- Once the recognition is finished, the automated transcription will appear in the text editor field.
- Language models are the new dictionaries in Transkribus: they have taken over most of the functionalities of the dictionaries.
- They are created automatically with the HTR-model and can be added to the recognition process:
- Click on “Run” within the “Text Recognition”-section of the “Tools”-tab.
- Click on “Select HTR model”
- In the appearing window you can find the language model option top right
- Click on the drop down menu and choose “Language model from training data”
- The effect of language models needs to be tested in the individual case: in many cases they are able to improve the recognition, but so far we also see cases, where they dont.
Figure 13 Language models
- Custom dictionaries a primarily used, if you are interested in special phrases in the document. A custom dictionary needs to be created by the Transkribus-Team, in case you would need one, please contact us via email@example.com.
Figure 14 Run model
- You can share your HTR model with other collections in Transkribus, whether they are owned by you or by other users.
- If you want to share your model with another collection, you must have access to that collection.
- Right click on the name of your model (on the left side of the “Choose a model” window).
Figure 15 Share a model by right-clicking the name of your model
- Then select “Share model…”
- The “Choose a collection via double click” window will open up.
- In the next window click the collection you would like to share the model and press “OK”.
- In this window, you can also create a new collection for the model with the “Create” button.
- Click “OK” to confirm.
Figure 16 How to share your model
- Once you have chosen the collection, click “OK” once more and the model will now be shared.
Figure 17 Confirm the sharing of your model
Figure 18 Model had been shared
- As soon as the training is finished, you can try out your model on any other historical document with similar writing.
- You can share your model with other people who can benefit from it too.
- You can repeat the training process with more data in order to generate more efficient results.
- You can measure the accuracy of your model with the “Compute Accuracy” function.
- The results of the HTR will depend on how similar and how clear the writing in the historical document is.
- The Transkribus team is working on an algorithm which will make it possible to automatically transcribe any kind of document, without the need to prepare training data. The technology is learning from all training data processed in Transkribus.
- So the more data we work with, the more efficient the technology will become. Train your own model and be part of it! ?
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.