5 AI Models For Transcribing Old Russian Handwriting And Printed Russian Texts

January 26, 2023
HTR models

As one of the world’s largest countries, Russia is also one of the most studied. Its turbulent history and influence on world politics make it the focus of many research projects, which often use historical documents — such as local registers, birth records or even personal diaries — as their primary sources.

In times gone by, deciphering the old Cyrillic handwriting or print within them used to be a time-consuming challenge requiring years of training. But AI has changed this. Using AI text recognition technology such as Transkribus, researchers can now simply run a scan of the document through the software and get an instant, automatic transcription. And as we all know, the less time we have to spend transcribing, the more time we have for the more satisfying parts of historical research.

If you work with historical documents in Russian, here are three public AI models that you can use with Transkribus to get instant transcriptions of your texts.

Russian Generic Handwriting 2

If you have a mix of documents from different genres and time periods, then this is probably the best model to start with. Based on earlier models from the Estonian State Archives and the INEL project in Hamburg, as well as the Russian Civil Records model (see below) and the Prozhito database, it encompasses a wide range of Ground Truths mostly from the late 19th and early 20th centuries.

With a CER of 5.8%, it is capable of giving fairly accurate transcriptions for a wide variety of documents and is an excellent starting point for training your own model.

→ Go to model

Russian Civil Records

This interesting model was created by the L’Dor V’Dor Foundation, who preserve Jewish historical records from around the world. They took handwritten civil records from Congress Poland, Ukraine and Russia from 1914 to 1968 as their Ground Truths, creating a model with a CER of 7.3%.

The model works particularly well with handwritten records from Congress Poland.

→ Go to model

Russian Handwriting Early 20th Century

This model is ideal for using with pre-form Cyrillic documents. It was trained on bilingual Evenki/Russian manuscripts by Russian ethnographer and linguist Konstantin M. Rychkov, who collected various pieces of cultural information from the Evenki culture and translated them into Russian.

The Ground Truth consisted of 581 pages from the Rychkov archive dating from 1911-1913, and it has a CER of 4.4%. The model was also created by the INEL project at the University of Hamburg.

→ Go to model

Russian Print 18th Century (V. Okorokov’s Printing House)

Created at the European University in St Petersburg, this model was based on a series of scientific papers published by V. Okorokov’s Printing House at Moscow State University. The papers were all printed in Russian, with some scientific terms given in Latin script.

The CER on the validation set is just 0.6% and the model shows good results on printed texts from other publishing houses of the era.

→ Go to model

Russian Print 18th Century

This more recent print model is based on Ground Truths from a wider variety of publishing houses operating in the 18th century, including those at the Academy of Sciences in St Petersburg and the Imperial Moscow University. It was developed as part of a student project at HSE University.

With a CER of 2.4%, it gives good results on Russian-language texts, but it does not recognise other languages that may appear in the text.

→ Go to model

How do I use a public AI model?

Transkribus’ transcriptions are based on AI models. Each model has been trained to read a specific type of handwritten or printed text in a certain language, and often a certain time period or genre too.

If you want to transcribe a document with Transkribus, you first need to upload a scan of the document and then you choose a model. There are currently 94 public models available, which are all completely free to use. Transkribus will take the information stored in that model and apply it to your document, creating an instant transcription.

But what if there isn’t a model that is suitable for the text in your documents? Then you also have the chance to train your own. To do this, you need a series of pre-transcribed documents known as “Ground Truths”. The more Ground Truths you use to train your model, the more information it will contain and the more accurate it will be when transcribing new documents. To save time, many people use a public model as the base for their custom model and then fine-tune it with further Ground Truths.
For more information about models and how to train them, check out our How-to Guides.

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

5 AI Models For Transcribing Old Russian Handwriting And Printed Russian Texts

Russian Generic Handwriting 2

Russian Civil Records

Russian Handwriting Early 20th Century

Russian Print 18th Century (V. Okorokov’s Printing House)

Russian Print 18th Century

How do I use a public AI model?

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community