HTR | API - READ-COOP

Training

For training a new HTR model using the new API (for RNN HTR), at first a configuration XML has to be created.
Besides parameters (the example below includes the default values) mandatory fields are:

a model name
a description
the language
the collection ID where the input documents can be found and where the resulting model will be linked

The input for training is described in the TrainList section of the XML and is made up of train elements where each includes:

the document ID
a list of pages where each page includes
- the page-ID
- the ID of the transcript version that should be used for training

Optionally a test set can be specified in the TestList element analogously.

The training descriptor then should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<citLabHtrTrainConfig>
    <modelName>Test Model</modelName>
    <description>A description</description>
    <language>German</language>
    <colId>2</colId>
    <numEpochs>200</numEpochs>
    <learningRate>2e-3</learningRate>
    <noise>both</noise>
    <trainSizePerEpoch>1000</trainSizePerEpoch>
    <trainList>
        <train>
            <docId>1</docId>
            <pageList>
                <pages>
                    <pageId>1</pageId>
                    <tsId>1</tsId>
                </pages>
                <pages>
                    <pageId>2</pageId>
                    <tsId>2</tsId>
                </pages>
            </pageList>
        </train>
        <train>
            <docId>2</docId>
            <pageList>
                <pages>
                    <pageId>3</pageId>
                    <tsId>3</tsId>
                </pages>
                <pages>
                    <pageId>4</pageId>
                    <tsId>4</tsId>
                </pages>
            </pageList>
        </train>
    </trainList>
    <testList/>
</citLabHtrTrainConfig>

That XML is then send via POST to

https://transkribus.eu/TrpServer/rest/recognition/htrTrainingCITlab

and the call returns the job-ID of the training.

Note, that the models are now linked to the collection they were started in (cf. colId element in training descriptor XML).

Manage HTR models

Listing available models

Listing models can be done with a GET request to:
https://transkribus.eu/TrpServer/rest/recognition/{collection-ID}/list?prov={techProvider}

The call includes:

Path parameter: collection-ID
Query parameter: the tech provider. Here at the moment only “CITlab” is allowed as value.

A model is described in the result e.g. by the following XML:

...
<trpHtr>
        <htrId>22</htrId>
        <name>Test Model</name>
        <description>A description</description>
        <provider>CITlab</provider>
        <created>
            <nanos>338000000</nanos>
        </created>
        <gtDocId>1614</gtDocId>
        <testGtDocId>1615</testGtDocId>
        <language>German</language>
        <trainJobId>3160</trainJobId>
        <cerString>1,000000</cerString>
        <charList> =1
,=2
.=3
...
</charList>
</trpHtr>
...

Worth noting is the htrId element which is used for applying the model.

The cerString includes CER values that have been determined during training against the train set or the (optional) test set.

The charList includes the character channel mapping and informs about the characters that are known to the model.

Adding models to other collections

In order to make a trained model available in another collection you POST to:

https://transkribus.eu/TrpServer/rest/recognition/{collection-ID}/{htr-ID}/add?collId={destination_collection-ID}

Removing models from collections

DELETE request to:

https://transkribus.eu/TrpServer/rest/recognition/{collection-ID}/{htr-ID}/remove

Note, that a model can not be removed from all collections.

Dictionaries

For applying an HTR model you need to provide a dictionary filename. Available dictionaries can be listed via GET to:

https://transkribus.eu/TrpServer/rest/recognition/dicts

Recognition

For applying an HTR model, a POST request has to be sent to:

https://transkribus.eu/TrpServer/rest/recognition/{collection-ID}/{htr-ID}/htrCITlab?id ={doc-ID}&pages={page-string}&dict={dictionary_filename}

The dictionary parameter is optional and can be omitted.
Note that the id and pages parameters may be replaced with an object in the request body, allowing for more detailed selection of the input data:

{
   "docId" : 1543,
   "pageList" : {
      "pages" : [ {
         "pageId" : 1234,
         "regionIds" : [ "the_xml_id_of_a_text_region" ]
      }, {
         "pageId" : 12345,
         "tsId" : 1234567
      } ]
   }
}

Equivalent XML representation:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<documentSelectionDescriptor>
    <docId>1543</docId>
    <pageList>
        <pages>
            <pageId>1234</pageId>
            <regionIds>the_xml_id_of_a_text_region</regionIds>
        </pages>
        <pages>
            <pageId>12345</pageId>
            <tsId>1234567</tsId>
        </pages>
    </pageList>
</documentSelectionDescriptor>

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

HTR | API

Training

Manage HTR models

Listing available models

Adding models to other collections

Removing models from collections

Dictionaries

Recognition

The COOP

Products & Services

Useful information

Helpful resources

Community