In this guide you will learn how to train a recognition-model in Transkribus. A trained model will help you to automatically transcribe and search your collection. You will need between 25 and 75 pages of manual transcription to get started. If you are working with printed rather than handwritten text, a smaller amount of training data is usually needed. Also, when using a base model, the amount of training data can be reduced. The model training functionality is not automatically included in the standard Transkribus interface. When you are ready to train a model, please contact the Transkribus team via firstname.lastname@example.org and they will give you access to the feature.
Training the model
The main options for the training of a model can be found in the “Tools” tab in the “Text Recognition” section. To open them click on the “Train”-button.
In the upper section of the appearing window you will need to add details about your model. Please add the following information:
- Model Name
- And a short description about the model and it’s background
Using a base model
With the use of a base model you can support the new model with the knowledge of an already existing model. You can add a base model to the training with this button. Suitable as base models are public models with similar writings are a model you have trained on the same or similar documents yourself.
Selecting the Ground Truth
Next, you need to select the pages that you would like to be included in your set of training data. In this list you can find the documents in your collection. By selecting the name of the document, you can add the whole document to the training set with this button.
By clicking on the arrow besides the document name you can choose individual pages. Pages without Ground Truth transcription are greyed out. The pages you have selected will appear in the “Training Set” space.
The Validation Set
During the training process, a validation set of pages is set aside and is not used to train the HTR and instead to evaluate the performance of the model.
To add pages to the Valdiation Set use this button. Pages you add to the validation set are automatically excluded from the train set. If you like you can use these checkboxes to automatically choose 2, 5 or 10% of the data as Validation Set
Starting the Training
Start the training by clicking the “OK” button and confirm the appearing windows.
The training process will take some time, depending on how many pages are part of the training. You can exit Transkribus during the training and return later. Meanwhile you can check the progress of the training with the jobs button.
After the training of your model is finished it will be available in your collection and you can use it to generate automatic transcripts.