Last update of this guide 04/09/2020
This guide will show you how to enrich your documents with structural tags like ”paragraph”, “heading”, “caption” or “footer”. This mark-up makes it possible to define the structure of your documents. By now it is also possible to train the structure of a document in Transkribus with the P2PaLA-training.
In case you are searching for information regarding the word and phrase based tags like persons, places etc. please have a look at the How to Enrich Transcribed Documents with Mark-up and the Transkribus Transcription Conventions guides.
Download the Transkribus Expert Client, or make sure you are using the latest version:
Transkribus and the technology behind it are made available via the following projects and sites:
The Transkribus Platform is provided by the European Cooperative READ-COOP SCE.
Until June 2019 Transkribus was financed as part of the Horizon 2020 READ-project under grant agreement No. 674943.
Table of Contents
With the structural tagging feature, you can mark up the structure of your documents.
Moreover it is possible to train models to automatically recognise the structure of your documents. Adding structural tags creates training data for this process.
There is no need to tag every feature of your documents – focus on marking up the sections that are of interest to you.
The structural tagging interface in Transkribus enables you to
Figure 1 Where to find the structural tagging options
Figure 2 Customize button
Figure 3 Create a new tag category
Figure 4 Customize colours
Figure 5 Choose colour
Figure 6 “Item visibility” button
Figure 7 Assigning tags with the green white cross button
Figure 8 Assigning tags by right clicking
Figure 9 Linking shapes
Figure 10 Choosing a page type
Figure 11 “Layout” section
Figure 12 Changing the structure type via the “Layout” section
Figure 13 How to delete structural tags
Figure 14 Show designations of the structural tags in the image
Figure 15 Draw struct type option
Figure 16 Show default colors
Figure 17 Currently marked structural tag
With the structural training feature you will get a model, that can recognise the structure of your documents. The efficiency will depend, as with the Handwritten Text Recognition, on the quality of the training data. If you have tagged about 50 examples of every structure type, which should be trained this should be fair enough to start training, so 50-100 pages of training material should be suitable to create a useful model. Of course it is possible to start training earlier, with decrease in efficiency.
The P2PaLA is not included in the standard expert client, if you would like to try it out, please drop us a short email at firstname.lastname@example.org so we can enable your account for it.
After finishing the tagging process you can start training. For this open the “Tools”-tab and click the “P2PaLA”-button in the “Other tools”-section. The following window will open:
Figure 18 P2PaLA training settings
Relevant settings for the training here:
If you click on “Train” the training parameters will open up:
Figure 19 P2PaLA training parameters
In the upper sections some details about the model need to be added.
“Structures”: here you can add the structure types, which should be trained. When entering please pay attention to case-sensitivity and not to use the space bar. We recommend to use only lower case. Moreover we recommend to use hyphens (-) and underlines (_) as the only special characters.
“Merged Structures”: are used to treat certain structure types the same as others during training (e.g. ‘footnote-continued’ or ‘footer’ like ‘footnote’). Expected is a list of the structure types, separated by a colon with the structure types to merge.
“Training mode”: here you can decide if you would like to train regions only, lines only or both. Please be aware that the baseline-training doesn’t mean, that structure types are trained on line basis. It is instead about the recognition of the baselines.
“Edit status”: if you would like to use the latest version, you don’t need to choose anything, otherwise you can choose, which status of the document should be trained.
“Training set”: this is the place to choose the training data.
“Analyze structure types”: gives an overview about the number and types of structure tags within the chosen document.
To start the training, click “Train”.
After the training process is finished, the model is available for your collection and can be shared with other collections too.
If you would like to apply a structure model to a document in order to let structure types be recognised, open the “P2PaLA”-feature within the “Tools”-tab.
Figure 20 Applying a P2PaLA model
Choose which pages should be recognised.
After choosing one of the options, the available models will appear next to: “Select a model for recognition”. Choose the model you would like to use. An overview of all the models you get by clicking on “Models”.
“Rectify regions”: all regions will be simplified to the bounding box of the actual recognized shape
“Min area”: Shapes with an *area* smaller than this fraction of the image *width* will be removed after the recogniton. Use this parameter to remove small “garbage” regions. The default value is 0.01
To start the recognition, click “Run”.
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.No Comments
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
|PHPSESSID||This cookie is native to PHP applications. The cookie is used to store and identify a users’ unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.||1 year|
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
|VISITOR_INFO1_LIVE||This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.||5 months|
|IDE||Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.||2 years|
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
|GPS||This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location||30 minutes|
|tk_or||This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack||5 years|
|tk_r3d||The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience||3 days|
|tk_lr||This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack||1 year|
|_ga||This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site’s analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.||2 years|
|_gid||This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.||1 day|
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
|YSC||This cookies is set by Youtube and is used to track the views of embedded videos.||1 year|
|_gat||This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.||1 minute|