This guideline explains how to use the PyLaia training feature to train a model to recognise printed or handwritten text in your documents. After the training the model will help you to automatically transcribe and search your collection. The workflow for the model training with PyLaia is basically the same as with HTR+. Therefore, this guideline focuses on the parameters which can be set at the PyLaia-training. If you have more general questions on the model training and how it is done in Transkribus, you can find more information here: How to Train and Apply Handwritten Text Recognition Models in Transkribus.
- The Transkribus platform allows users to train models to automatically process a collection of documents. PyLaia is another engine, which is supported besides the CITlab-HTR+ engine.
- The two engines work quite similarly and so the results in Character Error Rate (CER) usually are.
- One difference is that in PyLaia users can set several parameters on their own. The net structure of PyLaia can also be changed – a playground for people, who are familiar with machine learning. Modifications on the neural net can be done via the Github repository.
- The HTR+ will usually bring better results with curved or rotated lines but we are optimistic that PyLaia will be able to keep up soon in this.
- In order to use the PyLaia training function please contact the Transkribus team (email@example.com) and they will give you access to the feature.
- If you would like to use the Text to Image tool please use HTR+. For PyLaia it is not implemented yet. More information on how to import existing transcription to Transkribus you can find here: https://readcoop.eu/transkribus/howto/how-to-use-existing-transcriptions-to-train-a-handwritten-text-recognition-model/
- Documents, which have been recognised with a PyLaia-model can be searched with the Fulltext search (Solr) in Transkribus.
- We recommend that you start the training process with between 5,000 and 15,000 words of transcribed material, depending on whether the text is printed or handwritten. Base models can reduce the amount of training data required.
- As base model you can either use one of the publicly available models in Transkribus, if there is a suitable one for your documents or one of your models. For PyLaia only PyLaia-models can be used as base models. You will find an overview of the currently available public models here: https://readcoop.eu/transkribus/howto/public-models-in-transkribus/
- The preparation of training data follows the same procedure as with HTR+-models. You can read how it is done here: How to Train and Apply Handwritten Text Recognition Models in Transkribus.
- The main options for the training of a model can be found in the “Tools” tab in the “Text Recognition” section.
- As “Method”, please choose “HTR (CITlab HTR+ & PyLaia)”.
- By clicking the “Models” button you can see which models are available and which documents they were trained on. If you choose “PyLaia” at “Technology” only PyLaia models will be shown.
- With the “Train” button you will arrive at the options for the training of models.
Figure 1: “Text Recognition”-section within the “Tools”-tab to access the PyLaia -training
Figure 2: Train-interface
The parameters for PyLaia can be found by opening the “Train”-window and then the “PyLaia”-tab.
Figure 3 PyLaia parameters
Max-nr. of Epochs
The epochs follow the same logic as for the HTR+. For a start it makes sense to stick to the default setting of 250. Please be aware that a too high number of epochs will slow down the training.
The value of 20 means, that if the CER of the Validation Set doesn’t go down within 20 epochs, the training will be stopped.
NOTE: important here and for trainings in general: the Validation Set needs to be variable and should possibly contain all types of elements of the documents included in the training set. If there is no or little variation in the Validation Set, the model may stop too early. Therefore if your validation set is rather small, please increase the “Early Stopping”-value in order to avoid the training from stopping before it has seen all the training data. Conclusion of this: don’t safe effort at the Validation Set.
It is possible to add a base model to your training. If you choose this option, the neural nets will learn quicker and you will save time. To have a benefit the base model needs to be similar to the writing it should recognise. With the help of a base model it is possible to speed up the training process. Likely you will also improve the quality of your recognition results with a base model. However this is not always guaranteed and has to be tested for the specific case.
One big benefit of working with base models is, that they make it possible to start with a smaller amount of training pages, which means that the transcription workload is reduced.
To use a base model, you simply need to choose the desired one with the “Choose…” button next to “Base Model:”.
Figure 4: Adding a base model
The “Learning Rate” defines the increment from one epoch to another, so how fast the training will proceed. With a higher value, the CER will go down faster. BUT: the higher the value, the higher the risk, that details are overlooked.
This value is adaptive and will be adjusted automatically. The training is influenced though by the value it is started with. You can go with the default setting here.
We have had some cases, where the pre-processing took too much time. If this happens to you, you can switch the “Image Type” to “Compressed”.
You can proceed in the following way: start the training with “Original”. Every now and then check the progress of the pre-processing with the “Jobs”-button. In case it will get stuck, you can cancel the job and restart it with the “Compressed”-setting.
You can open the advanced parameters for PyLaia by clicking on the “Advanced parameters”-button at the bottom of the standard PyLaia-paramters within the “PyLaia”-tab.
Figure 5 and 6: Advanced parameters
Deslant: choose this option with cursive writing in order to straighten it. Leave out this option with printed documents, because if printed documents contain cursive passages in addition to the normal print characters, the effect can be upside down.
Deslope: allows more variation at the baselines, e.g. more tolerance at baselines, that are not exactly horizontally but slanting.
Stretch: this option is for narrow writing in order to uncompress it.
Enhance: that is a window, which goes over the baselines in order to optimize passages, which are difficult to read. This is useful if you have “noise” in the document.
Enhance window size: this setting refers to the option just explained and therefore only needs to be set, if you would like to use “Enhance”. This setting defines the size of the window.
Sauvola enhancement parameter: please stick to the default setting here.
Line height: value in pixels; if you need to increase the pixels of the images you can do this here. 100 is a good value to go for. Attention: if the value is too high it might lead to a “out of memory order”. You can bypass this error in turn by lowering the value of the “batch size” (top left in the advanced parameters window), e.g. by half. Please be aware that the lower this value the slower the training will be. The slow-down of the training relating to the batch size should be improved with the new version of PyLaia, which will set the batch size automatically.
Line x-height: this setting applies to the descenders and ascenders. If you put this value, the „Line height” parameter will be ignored.
Please don’t change the following parameters:
Features surrounding polygon
Features surrounding polygon dilate
Left/right padding: 10 (default) means, that 10 pixels will be added. This is useful if you are worried, that parts of the line could be cut off.
Max width: maximum of width that a line can reach, the rest will be cut off. 6000 (default) is already a high value. If you have huge pages, you can further increase this value.
For all those, who are familiar with machine learning and the modification of neural nets. Therefore, these parameters are not further explained here.
Batch size: number of pages, which are processed at once in the GPU. You can change this value by putting another number.
Use_distortions True: the training set is artificially extended in order to increase the variation of the training set and in this way make the model more robust. If you are working on even writing and good scans, you don’t need this option. To deactivate it, please write „False“ instead of „True“.
Measuring and understanding results
The validation set will be saved to the collection, from which the training has been done, that’s also where the recognition is processed. After the automated recognition, you can measure the accuracy of your model with the “Compute Accuracy” function, which you can find within the “Tools”-tab.
- As soon as the training is finished, you can run your model on any other historical document with similar writing. A language model (comparable with a dicitionary) can be added to the recognition process as well.
- You can share your model with other people who can benefit from it too.
- You can repeat the training process with more data in order to generate more efficient results.
- The results will depend on how similar and how clear the writing in the historical document is.
- The Transkribus team is working on an algorithm which will make it possible to automatically transcribe any kind of document, without the need to prepare training data. The technology is learning from all training data processed in Transkribus.
- So the more data we work with, the more efficient the technology will become. Train your own model and be part of it! ?
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.