+ Working with (Early Modern) Dutch script? Join a new Transkribus working group!

November 20, 2018
HTR models, News, Transkribus, Working groups

by Annemieke Romein, University of Ghent

(Dutch language version below)

Throughout the Early Modern era much was written in the Dutch language, not just in the Low Countries – but in former colonies, among certain religious groups within Northern America, and in Hansa cities as well. An Early Modern Gothic script was widely used, though it had some varieties depending on its contexts, aim, and type. First experiments with documents from Belgium (Ghent, in Flanders) have demonstrated that the Dutch language can be recognised by Handwritten Text Recognition (HTR) models with a good level of accuracy.

The next step is to combine different examples of Early Modern Dutch texts in order to build and improve generic models for the recognition of various types of documents. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University), Dr. Jetze Touber, and Koen Verstraeten have initiated the ‘Early Modern Dutch’ working group – where all Transkribus users can work together towards the aim of the improved recognition of the Dutch language. Scroll down to find out more about joining the working group and its aims.

The process of combining training data of different Early Modern Dutch documents has already started at Ghent University. Various researchers at the Institute for Early Modern History and the Ghent Center for Digital Humanities are bringing materials together in order to train a HTR model. However, within a multidisciplinary group such as this, we have quickly realised that there are various types of texts as well as periods within the early modern period to deal with. Sixteenth-century handwriting is different from a century later, even when in terms of content little changed; likewise, texts written with a political-institutional or legal background will differ tremendously from diaries, letters and academic texts. Nonetheless, each of these types of texts can train the recognition of the text as well as of the handwriting. How smart computers can be made, is yet to be discovered within such a context.

In order to streamline this endeavour, three Ghent-based historians are working together and will be coordinating/ training different language models, hopefully leading to one final model for the Dutch language (depending on the amount of training material).

Dr. Annemieke Romein	16^th, 17^th, 18^th century	Political-institutional/ legal texts (incl. requests, letters of statesmen).
Dr. Jetze Touber	16^th, 17^th, 18^th century	Cultural texts (diaries, letters); Scholarly, academic and religious texts.
Koen Verstraeten	19^th century	Cultural texts (diaries, letters); Scholarly and academic texts.

The ‘Early Modern Dutch’-working group is looking for further examples of documents written in Dutch from the 16th, 17th and 18th century. You can help us add to the collection – all that is needed are images (preferably around 300 dpi) and transcriptions.

You can:

share existing training data that you have already prepared in Transkribus (duplicate it to the folder we will invite you to).
prepare new images and transcripts in Transkribus in the ‘Early Modern Dutch’ collection
send over files containing images and transcripts which can be matched automatically and converted into training data using the Text2image tool.

Please do indicate what type of textual material you are sharing, so that we have an overview and can start training models a.s.a.p..

To join the working group and get access to the ‘Early Modern Dutch’ collection in Transkribus, contact the group at: TranskribusEMDutch@gmail.com.

The ‘Early Modern Dutch’ working group aims to demonstrate that training based algorithms like Handwritten Text Recognition need significant input from many stakeholders – they can only be improved by cooperation and sharing!

————————————————————————————————————

Werkt u met Vroegmoderne Nederlandse teksten (± 1500-1900)? Sluit u aan bij de Transkribus werkgroep!

Er zijn veel teksten geschreven in de Nederlandse taal, niet alleen in de Lage Landen zelf, maar ook in voormalige koloniën, bij religieuze groepen in Noord-Amerika, alsook in de Hanze steden. Het vroegmoderne gotische schrift werd veel gebezigd, hoewel er variaties te vinden zijn afhankelijk van de context, doel en het type tekst. Eerste experimenten met documenten laten zien dat de Nederlandse taal middels Automatische Tekst Herkenning (OCR) modellen herkend kunnen worden en dat middels training hier goede resultaten geboekt kunnen worden.

De volgende stap is het combineren van verschillende voorbeelden van Nederlandse teksten, in een poging om algemene taalmodellen te maken die verschillende typen documenten kunnen analyseren en herkennen. Dr. Annemieke Romein (Erasmus University Rotterdam/ Ghent University – IEMH), Dr. Jetze Touber (UGent – IEMH), en Koen Verstraeten (UGent archief) nemen het initiatief om een werkgroep ‘Vroegmodern Nederlands’ te starten. De focus ligt op de periode 1500-1900, maar materiaal uit andere perioden is eveneens welkom. In deze groep kunnen Transkribus-gebruikers samenwerken om de herkenning van de Nederlandse taal van teksten te verbeteren. Leest u vooral verder om meer te komen over deelname aan deze groep en de doelen.

Het proces van het combineren van trainingsmateriaal van verschillende vroegmoderne teksten is al enige tijd aan de gang. Aan de UGent zijn verschillende onderzoekers van het Institute for Early Modern History en het Ghent Center for Digital Humanities bezig met het uploaden van hun materialen naar Transkribus. Via Text2Image worden bestaande transcripties aan foto’s gekoppeld en worden computers getraind. Dit is momenteel in volle gang. We hebben ons al snel gerealiseerd date r verschillende typen teksten bestaan, alsook verschillende tijdsperioden waarin gradueel veranderingen optreden. Alle soorten teksten kunnen worden getraind in Transkribus, maar daar is veel trainingsmateriaal voor nodig. Méér dan een enkele onderzoeker kan verzamelen. Daarom deze oproep tot deelname.

Transkribus (voorlopig) een gratis programma dat kan worden gebruikt om servers in Innsbruck te trainen om handschriften (maar ook drukwerk) te herkennen middels “Handwriting Text Recognition” (HTR). Ten minste 75 pagina’s getranscribeerde tekst zijn nodig om een handschrift goed te kunnen herkennen, maar dat betreft dan één auteur. Hoe meer materiaal er wordt geüpload, hoe universeler wordt het model. Het wordt dan steeds breder toepasbaar. Archieven, bibliotheken en erfgoedinstellingen, maar zeker ook individuele onderzoekers wordt dringend verzocht om hun materiaal te delen dat de 16^e tot en met 19^e eeuw bestrijkt.

Drie Gentse onderzoekers zijn betrokken bij het coördineren van het Nederlandstalige model en zullen tests uitvoeren om een zo accuraat mogelijk model (of modellen) te trainen. Voornoemde onderzoekers houden zich bezig met respectievelijk:

Dr. Annemieke Romein	16e, 17e, 18e, , 19e eeuw	Politiek-institutionele/ juridische teksten (incl. rekesten, brieven van staatslieden)
Dr. Jetze Touber	16e, 17e, 18e eeuw	Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.
Koen Verstraeten	19e eeuw	Culturele teksten (dagboeken, brieven); wetenschappelijke, academische en religieuze teksten.

Als u materiaal beschikbaar wilt stellen en deel wilt nemen aan deze werkgroep vragen wij u om contact op te nemen via TranskribusEMDutch@gmail.com. Het is handig als u dan aangeeft om welk type teksten het gaat, zodat wij een beeld hebben in welke modellen wij dit kunnen gaan gebruiken.

Veel gestelde vragen:

Afbeeldingen en transcripties die u via Transkribus op hun server plaatst (direct via het programma, of via de Text2image tool) blijven privé: u heeft hier uitsluitend toegang toe.
U kunt er voor kiezen bepaalde documenten te delen (dupliceren) naar de groep Vroegmodern Nederlands. Deze groep heeft uitsluitend tot doel het trainen van taalmodellen om het Nederlands sneller te doen herkennen. Deelnemers van deze groep kunnen teksten van anderen zien.
Het is dus uw keuze welke documenten u met ons deelt! Hoe meer materiaal ons bereikt, hoe makkelijker het wordt om taalmodellen te trainen.
U heeft materiaal (foto’s en transcripties) maar u gebruikt nog geen Transkribus? Geen probleem. Wanneer u een account aanmaakt en contact opneemt met Transkribus (email@transkribus.eu) kunnen zij u helpen in het proces. U kunt het materiaal op verschillende manieren beschikbaar stellen en Transkribus koppelt de afbeeldingen aan de transcripties. (Tot mei 2019 is deze service gratis).
U bent een instelling en vraagt zich af wat voor u het voordeel is? In de eerste plaats traint het een taalmodel dat heel veel onderzoekers en instellingen (incl. de uwe) van dienst kan zijn bij het sneller herkennen van handschriften. Het materiaal dat u beschikbaar stelt, in uw eigen account van Transkribus, kunt u ook gebruiken om doorzoekbare Pdf’s te maken. U heeft dan een afbeelding van het bronmateriaal, met op de achtergrond de transcripties (of naar keuze: eveneens eronder geplaatst); dit kunt u gebruiken om het materiaal voor uw publiek beschikbaar te stellen. Het is dus eveneens een mooie manier van presenteren!
Kosten? Tot juni 2019 is Transkribus gratis. Het wordt momenteel via Europese Onderzoeksgelden gefinancierd (het READ project). Na juni start “READ-COOP” waarin individuele gebruikers gratis gebruik blijven maken, maar ‘groot gebruikers’ zoals instellingen een bijdrage gevraagd zal worden. Hoe hoog deze kosten zullen zijn is nog niet precies bekend, maar er wordt benadrukt dat dit niet al te hoog zal zijn omdat het besef er is dat instellingen hier meestal niet veel geld aan kunnen uitgeven. MAAR: voorlopig is de service gratis en kunt u dus de “doorzoekbare pdf’s” als tegenprestatie krijgen en u kunt altijd na juni stoppen met gebruik van het programma!

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

+ Working with (Early Modern) Dutch script? Join a new Transkribus working group!

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community