How to Train Baseline Models in Transkribus

How to Train Baseline Models in Transkribus

Transkribus Tools
Transkribus Expert Client
Last update 6 months ago
About Transkribus

Transkribus is a comprehensive solution for the digitisation, AI-powered text recognition, transcription and searching of historical documents. Find out more about Transkribus here

Transkribus is a comprehensive solution for the digitisation, AI-powered text recognition, transcription and searching of historical documents. Find out more about Transkribus here

Table of Contents

Table of Contents

Introduction

Layout Analysis (LA) is a fundamental step before applying a HTR model to transcribe the documents automatically. It segments the image into text regions and baselines, and it is necessary to connect image and text for HTR to work. 

Usually, Layout Analysis is performed automatically by clicking on the “Tools” tab and, under the section called “Layout Analysis”, selecting the pages on which to run the segmentation, as explained here.

The default Layout Analysis tool works well for most document typologies but may not be as accurate with documents with complex layouts, such as newspapers, postcards, registers, annotated documents, etc.

If the default automatic Layout Analysis tool works well on your documents, you can continue using it, and you do not need to train a Baseline model.

On the contrary, if the default Layout Analysis is unsatisfactory for your documents, you can train a Baseline model specific to your document typology. After the training, you can apply your customised Baseline model to your documents, which will be segmented following the examples you provided for training. The Baseline model’s training and application are possible only in Transkribus eXpert.

Before starting training a Baseline model, remember the difference between it and P2Pala. P2Pala recognises the structure of your documents automatically, enriching them with structural tags. On the contrary, a Baseline model detects only baselines but has the advantage of being specifically trained on the layout of your documents. For this reason, it should be more accurate than the default Layout Analysis recognition tool.

Preparation

The first step is to prepare the pages on which to train the Baseline model. A good number to start with is 50 pages, but the model efficiency depends on the complexity of the layout. After the first training with 50 pages, you could decide if the Baseline model is good enough or if it needs more training material.

To prepare the pages, it is only necessary to segment, automatically or manually, the text regions and the baselines. To work more easily on the layout, you can activate the Segmentation view at the viewing profiles, as shown in the figure below. In this way, the text editor is hidden and there is more space for the image to be shown.

Figure 1. Segmentation view

Depending on the layout complexity, there are three options to segment the pages:

  1. Run the default automatic Layout Analysis that you find under the “Tools” tab, as explained here, and then correct it manually using the Canvas menu to the left of the image.

  1. Draw the Text Regions manually using the button in the Canvas menu. Then, under the “Tools” tab, run the automatic Layout Analysis to detect the baselines: before running it, remember to uncheck the “Find Text Regions” option. Finally, go through the pages and correct them manually using the Canvas menu.

  1. Draw both the Text Regions and the Baselines manually, using respectively the button and the button in the Canvas menu to the left of the image.

Which option to choose depends on the document type and how poorly the default automatic Layout Analysis recognition performs. We suggest trying the first option and then moving to the other ones if you realise that correcting the generated segmentation is more time consuming than drawing it manually.

No transcription is required to be added to the pages before the Baseline model training since it focuses only on the baselines and the presence of transcribed text is irrelevant.

Training

Once the 50 or more pages are segmented, it is time to train the Baseline model. Click on the “Tools” tab. Under the “Model Training” section, click on “Train a new model”.

The Model Training window pops up, and on the right, you can choose which engine to train: for the Baseline model, please select “Baselines”, as shown in the figure below.

Figure 2. Model Training window

 Before starting training:

  • On the top left, enter the name and the description of your model.

  • On the top right, under the “Baselines” tab you just selected, there are the training parameters, i.e. the number of epochs and the learning rate. For the first training and if you are not familiar with machine learning, please do not change these parameters.

  • At the bottom, you need to select the pages you want to use to train the model, i.e. the pages you previously segmented into text regions and baselines.
    On the left, select the whole collection or the relevant pages. Click the Training button in the centre to add the selected pages to the Training Set. If you want to consider only the pages with Ground Truth status, select “Ground Truth only” in the drop-down menu on the right, under “Overview”.
    Do the same for the Validation Set. Remember that a good Validation Set should comprise all the different examples you would like the trained Baseline model to be able to segment. The Validation Set should be around 10% of the Training Set, so we suggest, for the first training, to include 45 pages in the Training Set and 5 pages in the Validation Set. If you want to automatically assign a percentage of the Training Set to the Validation Set, tick a percentage in the “automatic selection of validation set” option, before clicking the “Training” button.

  • On the right, under “Overview”, you can see all the pages assigned to the Validation Set and the Training Set.

After completing this phase, you can start training the Baseline model by clicking on the “Train” button in the bottom right-hand corner of the window.

Your Output

The training of the Baseline model could take from several hours to a couple of days, depending on the number of pages and the learning machine parameters. You can check the training progress by clicking on the “Jobs” button under the “Server” tab. 

When the training is finished, the Baseline model will appear in the “Server” tab, under “Model Data”. To see it, please select “layout” instead of “text” as model output type in the second drop-down menu, as shown below.

Figure 3. Layout as model output type

Double-clicking on the Baseline model name, you will see all the details and its learning curve. The “Learning Curve” graph shows the Baseline model’s accuracy. The x-axis indicates the number of Epochs, i.e. the number of times that the training data is evaluated. The y-axis measures the Loss, i.e. the percentage amount of pixels classified incorrectly. 

The program trains itself first on the Training Set; then, it tests itself on the pages of the Validation Set. For this reason, there are two lines in the graph. The blue line indicated the progress of the training; the red line indicated the progress of the evaluation on the Validation Set. Note that it is important that the two curves do not differ too much. If the two curves diverge, it is most likely that the Training Set differs too much from the Validation Set and the resulting model is not effective.

Figure 4. Learning Curve

Underneath the graph, the two percentages indicate how the Baseline model performs on the Training Set and the Validation Set in terms of Loss. The Loss on the Validation Set is the most significant value because it indicates how the Baseline model performs on new pages that it has not been trained on. Results with a Loss of 10% or below mean that the Baseline model is effective.

Applying your Baseline model

To apply the trained Baseline model to your documents, go to the “Tools” tab. Under the “Layout Analysis” top section, leave the “CITlab Advanced” Method selected as it already is and click the “Configure” button. The “Layout Analysis Configuration” window pops up, and under “Neural Net” you can choose the trained Baseline model you want to apply. 

Figure 5. Layout Analysis Configuration

By default, the Neural Net is set on “Preset”. To choose another model, click on the drop-down menu and select the trained model that best suits the layout of your documents.

The settings below enable you to choose to use separators and region grouping. 
Separators are special regions that you can either draw manually using the “Separator” button in the Canvas menu (to find it, click on the “Add other item” button) or which are produced by the “Printed Block Detection” Method. Separator information can then be used by the algorithm to split baselines according to those separators in the result. In detail, the options are:

  • Use separators:
    • Default: within a given text region, do not use separators. If no regions are given, use them.
    • Always: uses separators also within given regions.
    • Never: never uses separator information.

  • Region grouping:
    • Cluster lines: group lines into separate text regions.
    • Single bounding-box: just draws one big text region around all resulting lines.


If you are not sure about these settings, leave them as they are. 

Finally, click the “OK” button at the bottom of the “Layout Analysis Configuration” window. Your trained model has now been selected. 

Under the “Tools” tab, choose the pages on which to apply the Layout Analysis and click the “Run” button: the Layout Analysis job will now start. You can check its progress by clicking on the “Jobs” button under the “Server” tab. Once the job is finished, reload the page/pages and the text regions and baselines will appear in the images. No credit will be used to apply the Baseline model to your documents.