Public Models in Transkribus

Last update of this guide: 16/11/2020

This document should give an overview of the publicly available models in Transkribus we offer so far. You will find a short description of the training material, which languages the model can be useful for and who has created and trained it. We are working on making more and more models available for Transkribus users, so they can benefit from the network effect and save work and time.

The models in this document can be found in alphabetical order. The abbreviation “CER” in this overview stands for “Character Error Rate” and defines how many percent of the characters had been transcribed the wrong way by the neural network.

Download the Transkribus Expert Client, or make sure you are using the latest version:

Consult the Transkribus Wiki for further information and other How to Guides:

Transkribus and the technology behind it are made available via the following projects and sites:

Contact

  • The Transkribus Team: info@readcoop.eu

The Transkribus Platform is provided by the European Cooperative READ-COOP SCE.

Until June 2019 Transkribus was financed as part of the Horizon 2020 READ-project under grant agreement No. 674943.

Public models

Danish Handwriting 19th-20th century

Model name: Danish 1870-1950

Creator: Aarhus City Archives

This is a general model for Danish Handwriting from late 19th and 20th Century.

It is based on the model which follows next in this document (RoyalDanishLibrary_20thCentury+) and parish council minutes from Aarhus City Archives and the CER is 4.28%. The model has been created by Jan Mattias Jonsson Agger at Aarhus City Archives based on work by volunteers at the City Archives and the work from Jakob K. Meile & the staff at the Royal Danish Library.

Model name: Danish 1870-1950 v3.5

Creator: Aarhus City Archives

Newer incrementation of Danish 1870-1950 with added material and further experimentation with base models. Using material from The Royal Danish Library, Aarhus City Archive, Faxe Archive, NÊstved Archive and Gentofte Archive.  1 603 600 words trained, 5,91% CER on the validation set.

Danish 20th century

Model name: RoyalDanishLibrary_20thCentury+

Creator: Royal Danish Library

This is a general model for Danish cursive handwriting of the 20th century based on 16 different scribes. It had been created by Jakob K. Meile and his collegues in the Royal Danish Library. About 580 400 words had been trained and the CER goes down to 3.99% on the validation set. More information about the Royal Danish Library and its projects can be found out here: https://www.kb.dk/en/

Danish Fraktur 19th century

Model name: Danish Fraktur SB 19th century v.2.35

Creator: Poul Steen

This model is based on more than 500 pages (about 30 900 words) of the Royal Danish Court & State Calendar and a few pages of the Danish High Court Proceedings from the 19th century and the “NZZ Gold Standard” model has been used as base model (therefore the total of included trained words in German and Danish is about 745 000). The CER goes down to 0.97%.

Danish Handwriting 1881-1913

Model name: Gjentofte 1881-1913 Denmark 1000 epochs

Creator: Gentofte Community Archive Transkribus Team

This model is based on protocols from meeting in the locally elected community counsel. It is written in turn by the counsel members during the meeting and of varying quality, with several corrections and inserted additions between the lines. Use of non-standard abbreviations by some writers. More than 154 000 words have been trained and the CER is 4,43%.

Devanagari Mixed 19th-20th century

Model name: Devanagari mixed M1

Creator: Heidelberg University Library

This model recognizes the South Asian Devanagari-script. It is based on ca. 200 pages of late 19th and early 20th century books by the Indian Naval Kishore Press. The books were mainly printed in lead typesetting, but the training data also contains pages produced lithographically. The model is provided by Heidelberg University Library as part of the FID Asien project. Text and data of Naval Kishore Press – can be digitally accessed here: https://digi.ub.uni-heidelberg.de/en/sammlungen/suedasien/navalkishore.html

Devanagari Nagara 19th century

Model name: Devanagari_nagara_M1

Creator: Heidelberg University Library

The model recognized South Asian Devanagari-script. It is based on 65 pages of late 19th century books by the Indian Naval Kishore Press, all printed with the same type. The model is provided by Heidelberg University Library as part of the FID Asien project. Text and data of Naval Kishore Press – can be digitally accessed here: https://digi.ub.uni-heidelberg.de/en/sammlungen/suedasien/navalkishore.html

Dutch 18th century

Model name: Dutch Mountains (18th Century)

Creator: Amsterdam City Archives and National Archives of the Netherlands

This model is a combination of the 18th Century models from the Amsterdam City Archives (3500+ scans of 15 notarial handwritings) and the National Archives of the Netherlands (3500+ scans of VOC handwritings). The training set includes 1 384 893 words and the CER is 5,67%. When using models big as this one it makes sense to add the suitable language model (to be found at “Dictionary” in the recognition configuration in Transkribus) for your documents, or a base model, if you have already trained a model for your documents.

Dutch Gothic Print 16th-18th century

Model name: Dutch_Gothic_Print

Creator: Entangled Histories (National Library Netherlands)

This model is based on printed texts in the Gothic font that was used in the Low Countries, during the 16th, 17th and 18th century. The type of sources used for this model, are books of ordinances, which contained the norms (‘laws’) at the time. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called ‘Entangled Histories’. About 51 100 words had been trained for this model and the CER on the validation set is 1.71%. For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

Dutch Handwriting

Model name: IJsberg

Creator: National Archives Netherlands

This the second model created by the National Archives of the Netherlands. It is based on the careful transcription of dozens of different handwritings coming from the 17th, 18th and 19th century and comprises scans from the Incoming Documents from the Dutch East India Company (Overgekomen Brieven en Papieren van de VOC) of the National Archives of the Netherlands and of 19th century Notarial deeds from the Noord-Hollands archief and eight other State Archives in the provinces.

Epochs: 1000

Nr. Words: 1544683

Nr. Of lines: 248518

Dutch late 17th century

Model name: Dutch Margaretha Turnor 17th Century

Creator: The Utrecht Archives

This is the first model created by the Utrecht Archives. It is based on a thousand letters of Margaretha Turnor, who wrote to her husband during the late 17th century. She managed the castle of Amerongen, while her husband worked abroad as a diplomat for the Dutch Republic. Her letters provide an insight into family life in the Dutch Republic as well as the political situation in the country. About 36 000 words had been trained for this model and the CER on the validation set is 1,83%.

Dutch Notarial 18th century

Model name: Dutch Notarial Model 18th Century

Creator: City Archives of Amsterdam

This is the first 18th Century general model created by the City Archives of Amsterdam. It is based on thousands of scans from in total 15 different notaries who worked in Amsterdam during the 18th Century. All notaries (except Van Hoorn and Van Esterwege) have 10 scans validation included (2671 scans training, 130 for validation). The number of trained words is about 623 000 and the CER is 5.27% on the validation set.

Dutch Poetry 1603-1636

Model name: Dutch poetry 1603-1636

Creator: Bram Caers

The model was trained on an extensive manuscript of early modern poetry, in separate hands (of which one is the most important) using different types of writing and special lay-outs (e.g. chronograms).

The author of the manuscript is a rhetorician (vernacular poet) from Mechelen, present-day Belgium, active in the first decades of the seventeenth century. The training was based on a word count of over 51,000 words (more than 200 folios of text) and the CER is 4,78%.

Dutch Romantype Print 16th-19th century

Model name: Dutch_Romantype_Print

Creator: Entangled Histories project (National Library of the Netherlands)

This model is based on printed texts in the Roman-type fonts that were used in the

Low Countries, during the late 16th, 17th, 18th and 19th century. Some pages may have contained

(properly) transcribed Gothic font; as well as French or Latin texts have been included to ensure

the (more or less) proper transcription of words in those languages when occuring. The type of sources used for this model, are books of ordinances, which contained the norms (‘laws’) at the time. About 88 000 words had been trained and the CER on the validation set is 1.17%. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called ‘Entangled Histories’. For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

English Handwriting 18th-19th century

Model name: English Writing M1

Creator: University College London – Bentham project

This model was trained on over 50,000 words from papers written by the English philosopher Jeremy Bentham (1748–1832) and his secretaries. In the best cases, it generates an output where around 95 per cent of characters on similar pages from the Bentham collection are transcribed correctly by the programme. More info about the Bentham project can be found here: http://prhlt-kws.prhlt.upv.es/bentham/

Estonian Court Records 19th century

Model name: Estonian Court Records 19thC

Creator: Estonian National Archives

This model is based on Uue-PÔltsamaa Municipal Court Records (est. Vallakohus, ger. Gemeindegericht) from the years 1852-1866. It has been trained with 50 000 words and the CER for the latest model is 3.55% on the validation set. Digitised court records and their transcriptions are easily accessible in crowdsourcing platform http://www.ra.ee/vallakohtud/, made available by Estonian National Archives.

Finnish 19th century

Model name: NAF Court Records M10

Creator: National Archives Finland

This model is based on Renovated District Court Records (Fi: Kihlakunnanoikeuksien renovoidut tuomiokirjat, Swe: HÀradsrÀtternas renoverade domböcker) from the years 1809-1870. Models training set consists of 2841 double-pages and the validation set 100 double-pages.  Since there were many (dozens) scribes it is a combination of many different handwritings.
The Ground Truth material is picked across Finland from 58 different court districts. Most of the Ground Truth is in Swedish, but there is also some Finnish since from 1850s some of the court districts started to write Court Records in Finnish. Renovated District Court Records are split into two series: Main Records & Notification Records. This model includes mostly Notification Records. Nevertheless the model also works fine with Main Records. This model was created as part of the READ project at National Archives of Finland (NAF). It has been used to transcribe the Notification Records from the years 1809-1870 (all districts). As a result, a search interface has been implemented where you can perform full text searches and browse automatically transcribed documents. The search interface and more information can be found at: www.transkribus.eu/r/kws

French 18th Century Print

Model name: French_18thC_Print

Creator: Entangled Histories project (National Library Netherlands)

This model is based on printed texts in French (Romantype Font) that was used in Flanders (Low Countries), during the 18th century. The type of sources used for this model, are books of ordinances, which contained the norms (‘laws’) at the time. This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called ‘Entangled Histories’. The training set counts about 38 500 words and the CER on the validation set is 0.65%. The books used for this specific model, have been provided by the Bodleian Library Oxford (RECUEIL DES ÉDITS, DÉCLARATIONS, LETTRES-PATENTES, &c. ENREGISTRÉS

AU PARLEMENT DE FLANDRES). For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

French 17th Century Print

Model name: ParallĂšle des Anciens et des Modernes M2

Creator: Project: Un choc de modernité : Anciens et Modernes au tournant des XVIIe et XVIIIe siÚcles

This model is based on a printed text in French at the end of 17th century : ParallĂšle des Anciens et des Modernes by Charles Perrault (1688-1697, publisher : Jean-Baptiste Coignard).

It was trained fort a digital edition as part of the project “Un choc de modernitĂ© : Anciens et Modernes au tournant des XVIIe et XVIIIe siĂšcles” (IHRIM UMR 5317). More than 65 000 words have been trained and the CER is 2,70%.

French and Latin Chancery documents

Model name: HIMANIS Chancery M1+

Creator: HIMANIS project

As part of the HIMANIS project (lead by D. Stutzmann, C. Kermorvant & E. Vidal), the text edition provided by P. Guérin and encoded in TEI by the Ecole nationale des Chartes (http://corpus.enc.sorbonne.fr/actesroyauxdupoitou/) and the one by J. Viard were aligned at line level and used to train this comprehensive model for French and Latin Chancery documents. The training set includes about 666 000 words and the CER goes down to 5.33% on the validation set.

More information on the project can be found at: http://himanis.huma-num.fr/himanis/

French Livre Rouge

Model name: LaMOP-Livre_Rouge_1

Creator: Paris University

This model is based on the book “Y//3 Livre Rouge, ChĂątelet de Paris (11..-1790)” (Archives Nationales de France) and the model was released by Hugo Regazzi (Universite Paris 1/LaMOP), Pierre Brochard (CNRS/LaMOP) and Julie Claustre (Universite Paris 1/LaMOP). All the data and pictures can be found at: https://gitlab.huma-num.fr/lamop/htr/blob/master/Livre_Rouge-Archives_Nationales/README.txt

20 000 words have been trained for this model and the error rate is 8%.

German Fraktur 19th-20th century

Model name: ONB_Newseye_GT_M1+

Creator: Austrian National Library and NewsEye project

Thanks to the Library Labs of the Austrian National Library and the NewsEye project we are happy to announce the release of a free model which is capable to read German Fraktur documents especially from the 19th and 20th century in a convincing quality outperforming most standard OCR engines. The model is based on training data coming from the ANNO collection of the Austrian National Library and comprises 442.141 words. It shows a CER of 1,55% on the training set and 1,65% on the test set without any dictionary support. Note: the model is trained on German language documents. It will provide less convincing results for other languages, such as Swedish or Finnish Fraktur. However models for these languages are also in preparation and may be released in the coming months. The Fraktur model is available for every registered user in Transkribus and called: ONB _Newseye_GT_M1+. Have fun!

German Fraktur 18th-20th century

Model name: NZZ Gold Standard M1+

Creator: University of Zurich

The model is based on 167 title pages from the Neue ZĂŒrcher Zeitung (NZZ) covering the years 1780 to 1940. About 273 400 words had been trained for this model and the CER on the validation set is 0.45% (every 10th page has been taken as validation set). The model is provided by the Computational Linguistics Group (Simon Clematide, Philip Ströbel) from the University of Zurich within the framework of the Impresso project. https://impresso-project.ch/

German Kurrent and SĂŒtterlin 17th-20th century

Model name: German Kurrent M1+

Creator: Transkribus Team, University of Innsbruck

This is a global model, which recognizes German Kurrent, SĂŒtterlin and Fraktur scripts from 17th to 20th century. The training data set includes nearly 500 000 words and has a CER on the validation set of 5.29%.

Italian Administrative Hands 1550-1700

Model name:Italian Administrative Hands, 1550-1700

The Italian Administrative Hands model features a variety of Italian-language documents from state archives in Milan, Venice, Florence, Pisa, and Genoa. The training set represents a spectrum of humanistic, italic and cursive hands characteristic of administrative records, employed by secretaries and newswriters. The model has been trained to perform well with a mix of quantitative and qualitative information as well as many common proper nouns for the period, such as locations in Europe and contemporary rulers. Administrative documents often employ common superscript abbreviations, which the accompanying documentation treats in greater detail. The model can also be used with Latin, Spanish and French documents to some extent. The model represents a collaboration between Jake Dyble (Exeter/Pisa), Antonio Iodice (Exeter/Genoa), Sara Mansutti (Cork), and Rachel Midura (Virginia Tech). Documentation at https://emdigit.org/tool/2020/07/21/italian-administrative-hands.html.

The model will be used by the EURONews Project and the Medici Archive Project (https://www.medici.org/euronews-project/) and AveTransRisk (https://humanities.exeter.ac.uk/history/research/centres/maritime/research/avetransrisk/team/).

Latin (Greek, German, English, Italian) 16th-18th century

Model name: Noscemus GM v1

Creator: Noscemus project (University of Innsbruck)

he Noscemus general model is able to read printed Latin text, especially from the 16th, 17th and 18th century. The model was released by Stefan Zathammer and is based on training data coming from the Digital Sourcebook of the Noscemus project. The model is tailored towards transcribing (Neo-)Latin texts set in Antiqua-based typfaces, but it also, to a certain degree, is able to handle Greek words and words set in (German) Fraktur. The model comprises 170658 words and 27296 lines, it shows a CER of 0.87% on the training set and 0.92% on the validation set.

Latin and Dutch

Model name: Medieval Protocolbook ‘s-Hertogenbosch by Townclerck Petrus de Os sr., 1497-1542

Creator: Geerturi van Synghel (Huygens ING)

The Huygens Institute for History of the Netherlands, an institute of the Royal Netherlands Academy of Arts and Sciences, is unlocking the Aldermen Records of ‘s-Hertogenbosch 1366-1811. Because the protocolbooks from the period 1500-1810 can be considered virtually inaccessible, we started to unlock this registers with Trankribus. This model is the first to recognise late medieval handwriting and is based on the registers of the aldermans court of ‘s-Hertogenbosch (Brabant), written by the townclerck magister Petrus de Os senior. His handwriting can not only be found in several townseries like the protocolbooks from 1497 until 1542, but also in charters, cartularies and the towncronicle of ‘s-Hertogenbosch. The model is based on a protocolbook, containing the minutes of the charters on voluntary jurisdiction of the city. Most of the minutes are in Latin, only a few one are in Dutch. The trainingset is based on 105 pages with 46.638 words, written in late medieval gothic script.

The Character Error Rate on this model is 4,11 % on the validation set.

More information on the project can be found at:

https://www.huygens.knaw.nl/projecten/ontsluiting-van-de-schepenregisters-van-s-hertogenbosch-1366-1811

(Neo)-Latin

Model name: NeoLatin_Ravenstein_1643-1772

This model is based on the transcription of the “Litterae Annuae Parochiae Ravensteijn SJ ab Anno 1643 ad Annum 1772”.

The annual letters were kept at the Archivum Neerlandicum Societatis Iesu (Berchmannianum, Nijmegen) . These are now at the Catholic Document Center (Katholiek Documentatie Centrum (KADOC)) in Leuven, Belgium (inventarisnummer 15.606).

Tom Gribnau photographed the manuscript; the transcriptions were made by Pim Boer, Leo Nellissen.

About 64 000 words had been trained and the CER is 3.58%.

This belongs to the publication: Tom Gribnau, Pim Boer, Leo Nellissen, Paul Begheyn SJ & Charles Caspers, Martiaal en theatraal. De jezuĂŻeten in Ravenstein (1643-1772). Inleiding en vertaling van de jaarbrieven Nijmgen: Uitgeverij Valkhof Pers 2019; ISBN 978 90 5625 514 5 )

More information can be found at: https://jaarbrieven.blogspot.com/ and https://www.stilus.nl/litterae/

T2I from the transcripts to Transkribus model has been done by Dr. C.A. Romein

Russian Church Slavonic

Model name:

  • Combined_Full_VKS_2
  • VMC_Test_4+

Creator: Achim Rabus (University of Freiburg)

Prof. Achim Rabus from the University of Freiburg has released two specialized models which are able to read Russian Curch Slavonic. The first model is called VMC_Test_4+: Training data consist of parts of the Russian Church Slavonic Great Reading Menology (16th century). The model is tailored towards transcribing Cyrillic semi-uncial script from the 16th century. Character Error Rates for the training data are 3.72% and for the validation set 3.92% and for the validation set 3.82%.

The second model is called: Combined_Full_VKS_2: Training data consist of parts of the Russian Church Slavonic Great Reading Menology (16th century), Old Church Slavonic Codex Suprasliensis (11th century), and the 11th century manuscript of the Catecheses of Cyril of Jerusalem. This is a generic model suitable for transcribing a variety of Old Cyrillic script styles including uncial and semi-uncial. Character Error Rates for the training data are 4.42% and for the validation set 3.92%.

Achim has written a detailed report  about his usage of Transkribus. Though it deals with Church Slavonic it is definitely interesting for other users as well. Thanks a lot!

Swedish 17th century

The model “Jaemtlands_domsagasM1+” is trained on 5946 pages (ca. 491 300 words) from court books from JĂ€mtland county in Sweden – JĂ€mtlands lĂ€ns domsaga, from the years 1647-1688. The books are the original ones written by different local writers on location (not the copies that were written later and sent in to the royal court in Stockholm – “renoverade domböcker”). The texts are written in Swedish. The transcripts that are used are not 100% true to the original spelling. Some abbreviations are spelled out (for example r:dr = riksdaler) there are also a few remarks made in the transcripts in brackets. The CER is 6.32%.

Credits

We would like to thank our users who have made it possible to publicise these models.