Creating a linguistic corpus with Transkribus: Ewa Rodek

Languages evolve over time: even first-year linguistics students could tell you that. But understanding how they evolve is slighty more challenging.

At the Institute of Polish Language – Polish Academy of Sciences, a team of researchers is trying to gather insights into the evolution of the Polish language. They are creating a digital linguistic corpus of Polish texts from the 17th and 18th centuries, to make it easier to analayse the language used at this time. And as the corpus is designed to be entirely digital, they decided to transcribe texts using Transkribus.

We spoke to team member Dr Ewa Rodek to find out more about this exciting project in the field of Polish linguistics.

A Unique Linguistic Corpus

Ewa Rodek has always been a fan of languages and their evolution: “I really like the history of language, especially historical lexicography and literary culture.” Luckily for Ewa, she was able to turn her passion into a career at the Institute of Polish Language – Polish Academy of Sciences. Her team is currently working on a digital corpus of Polish texts from the 17th and 18th centuries, the so-called KorBa project. This unique corpus contains Polish texts of many different genres and styles from the Baroque and Enlightenment eras. Once completed, it will contain over 25 million tokens tagged according to their structure and morphology, making it the largest diachronic corpus of this type.

An impression of the variety of the material Ewa and her Team are dealing with

“KorBa is the first diachronic corpus of this size in Polish,” Ewa told us. “Its rich lexis makes it important for the scientific community. But its also important for our team because it is the main material base for the creation of the eSXVII.” The eSXVII is short for the Electronic Dictionary of the 17th- and 18th-century Polish, which the same team has been working on since 2004. “The corpus is integrated with eSXVII in such a way that the user can easily switch from an entry in the dictionary, to a specific search in the corpus, where the user can see how many times the word has been used and in what context.” In short, this project will give researchers much more information about the historical usage of individual words than they previously had.

Choosing Transkribus For Transcription

Historical texts like the ones in this project are not always easy to transcribe. “Our material is not exactly homogenous,” Ewa explained. “The manuscripts often have several different typefaces and even different languages on the same page — Polish coexists with Latin, German or French. Some of the documents also have stains or are damaged.” That’s why, at the start of the project, Ewa’s team chose to transcribe the documents manually. But that wasn’t as successful as they had hoped. “I have been involved in the workflow from the beginning, so I know the problems you encounter using manual transcribers, especially regarding delays. The transcription and proofreading of documents in Gothic script also requires specialist knowledge, which makes it quite expensive.”

The team had already ruled out using OCR technology — “I had a lot of experience with OCR software and I knew that it would not improve our work” — but then Ewa discovered the HTR possibilities of Transkribus. The software would not only make things quicker, but the team were able to use it even though they had no experience in coding or software. “It was particularly important that we could manage the work ourselves and didn’t have to ask our IT colleagues for help. Using Transkribus was also much cheaper than hiring transcribers.”

Transcribing The Documents

And thankfully, Ewa did not regret her decision to use Transkribus. The team started by creating their own AI models, and they achieved some pretty impressive character error rates. “The biggest advantage of Transkribus is that it learns very quickly. We developed two models – one for printed texts (with a CER of 0.29%), the other for manuscripts (with a CER of 1.8%),” Ewa explained. And this was despite the various fonts, languages and conditions of the documents: “Transkribus offset these difficulties and handled them very well.”

The fact Transkribus can be used easily by a whole team was also a plus for this large project. “The ability to work in teams, administer models and document collections is very helpful,” Ewa said. “Exporting a pdf file from an image with an underlying text layer is extremely powerful as well. I will definitely use this option in my future projects.”

The Next Steps

Ewa’s team might have completed the transcription of the texts, but the KorBa project isn’t quite finished yet. “We have now finished the longest part of the process – that is collecting and transcribing the texts. Now we are preparing the training material for the tagger software and then we have to combine the results of the first and second editions of the project.” It sounds like the team are going to be busy for a while yet!

Thank you to Ewa Rodek and her team from the KorBa project for talking to us!

Ewa’s Trankribus Tip

Before training the model it is important to establish which characters you want to retain and which are just the writer’s spelling mannerisms, particularly if the alphabet in your documents isn’t standardised. For example, the character ÿ appeared in our material. After a while, we realised that this wasn’t a ligature from the combination of the letters ij, but was used interchangeably with the letter y. Therefore, we stopped marking the character ÿ as a separate letter and transcribed it simply as y. By establishing a list of characters like this, you can avoid errors in the Ground Truth and so create a more accurate transcription.
Ewa Rodek, Institute of Polish Language – Polish Academy of Sciences

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

Creating a linguistic corpus with Transkribus: Ewa Rodek

A Unique Linguistic Corpus

Choosing Transkribus For Transcription

Transcribing The Documents

The Next Steps

Ewa’s Trankribus Tip

The COOP

Products & Services

Useful information

Helpful resources

Community