Ethiopic, Hebrew, Devanagari, Balinese and Ottoman Turkish: 5 public AI models for transcribing non-Latin scripts

May 3, 2023
Transkribus

If you scroll through the list of Transkribus’ public AI models, you might be forgiven for thinking the platform can only be used for European languages in Latin scripts, such as German, English, or Dutch. But thankfully for researchers working with more “unusual” languages, this is not the case. Transkribus has a wide variety of public models for many different scripts and languages, all of which can be used to automatically transcribe printed or handwritten texts in those languages.

In this post, we are going to take a look at four public AI models for the non-Latin scripts of Ethiopic, Hebrew, Devanagari, Balinese and Ottoman Turkish, and see how they can make your work with documents in those scripts more efficient.

Ahmet Sudi Bosnawi (d. circa 1595), Shehr-i Hafiz, signed Taher ‘Umar, Ottoman Turkey, dated 1821-2. Public Domain, via Wikimedia Commons

Ottoman Turkish Print

As the official language of the Ottoman Empire (14th to 20th centuries CE), Ottoman Turkish was primarily a literary language, written in Arabic script and heavily influenced by Arabic and Persian. With the end of the Ottoman Empire in 1928 and the establishment of the Republic of Turkey, a language reform followed, where the public use of Ottoman Turkish was replaced by what is now called ‘modern Turkish’. As part of this reform, loanwords of Persian and Arabic origin were replaced with their Turkish equivalents, and the Ottoman Turkish script was changed to a Latin alphabet for the Turkish language. Since Ottoman Turkish was an official script in the Ottoman Empire for multiple centuries, even though it could be considered a ‘dead’ script today, it is valuable to have legible and available sources to help us understand the past.

This model for Ottoman Turkish was created by the Digital Ottoman Corpora team, led by Süphan Kırmızıaltın, who are working on transcribing Ottoman Turkish printed texts into modern Turkish to make them more accessible. The materials used to train the model were six Ottoman Turkish periodicals from the late 19th and early 20th centuries, covering a wide range of topics, and an Ottoman Turkish dictionary.

→ Go to model

King Theodores Bible, double folio: St. John. Wellcome Collection. Public domain, via Wikimedia Commons

Ethiopic – Classical Ethiopic scripts from Ethiopia and Eritrea

Often known as Ge-ez, Classical Ethiopic was one of the most important Semitic languages in the part of Africa which is now Ethiopia and Eritrea. This part of the world had considerable cultural, political, and religious influence in the Late Antique and Medieval periods and so understanding the Ethiopic language is important for understanding primary sources written at the time. While it is no longer a living language, Ethiopic is closely related to the modern-day languages of Tigrinya, Arabic and Hebrew and remains the liturgical language of both the Ethiopian and the Eritrean Orthodox Tewahedo Churches.

This Classical Ethiopic model was developed as part of the Beta maṣāḥǝft project, which is hosted by the Hiob Ludolf Centre for Ethiopian Studies at the University of Hamburg. The project aims to create a multimedia research environment for the study of Classical Ethiopic manuscripts and the training data for this model was also used for the project’s database. The model has a CER of just 3.8%.

→ Go to model

Bhagavata Purana manuscript, 18 century. Bhaktivedanta Research Centre, Kolkata. Public domain, via Wikimedia Commons

Devanagari Mixed M1A

The Devanagari script, sometimes incorrectly known as the “Indian alphabet”, is the writing system used for several Indic Aryan languages including Hindi, Sanskrit, and Marathi. Letters in the Devanagari alphabet have a long horizontal stroke at the top of each one, which are joined together with the strokes of all the other letters in the world. Not only is the Devanagari script used on a day-to-day basis by millions of people around the world, but it is also the script used for most sacred texts in Hinduism, as well as many in Buddhism and Jainism, and so has an important religious significance too.

This model for the Devanagari script has been trained on a range of materials in the languages of Hindi, Sanskrit, Braj Bhasha, and Awadhi. All the materials were published in print by the Naval Kishore Press in Lakhnau, North India, during the late 19th and early 20th century. The model was created by the University Library of Heidelberg and has a CER of just 2.2%.

→ Go to model

“All Souls Deuteronomy” from the Dead Sea Scrolls. Leon Levy Dead Sea Scrolls Digital Library. Public domain, via Wikimedia Commons

Hebrew Script Languages

A descendant of the Aramaic alphabet, the Hebrew script is the writing system not just for the Semitic language of Hebrew, but also for several other languages including Yiddish and Ladino. It is also the script used for most sacred Jewish texts, as well as many other cultural works, making it one of the most studied non-Latin scripts. Interestingly, the Hebrew alphabet originally contained only consonants and speakers had to fill in the vowels themselves when reading texts out loud. However, over the years, scholars and scribes started to mark vowels using a series of points known as niqqud.

This model for Hebrew script languages was created by the Digitizing Jewish Studies project, which was set up by Dr. Sinai Rusinek, at Haifa Universit and supported by the Rothschild Foundation Hanadiv Europe. The model was trained with texts in several languages, including Hebrew, Yiddish, and Judeo-Arabic, making it suitable for a wide variety of documents in the Hebrew script.

→ Go to model

Writing on a palm-leaf manuscript. Tropenmuseum Collection. Public domain, via Wikimedia Commons

Balinese Palm-Leaf Manuscripts 16th Century

The Balinese script is a traditional writing system from the island of Bali, Indonesia, which was used for texts in the Balinese language as well as in Old Javanese and Sanskrit. The alphabet consists of 47 letters — not all of which are used when writing Balinese — and uses diacritics to indicate the pronunciation of certain syllables. Nowadays, the Balinese language is usually written in the Latin script, and increasingly fewer people are familiar with the Balinese script. However, it still has a strong cultural significance and features in many of the island’s traditions.

One of those traditions is the creation of palm-leaf manuscripts, or lontar. These have been used for centuries as a way of preserving texts, from religious scriptures to works of literature. Developed by researchers at the NIT Trichy and the IIIT Hyderabad in India, this model was based on a range of palm-leaf manuscripts in the Balinese scripts. All training material dates from the 16th century, making it ideal for transcribing manuscripts from this period.

If you work with palm-leaf manuscripts, you may also be interested in our current collaboration with the Wikimedia Foundation, which aims to digitise and transcribe over 20,000 handwritten Indonesian palm-leaf manuscripts. You can find out more about that on their website.

→ Go to model

How do I use a public AI model with Transkribus?

Transkribus’ transcriptions are based on AI models. Each model has been trained to read a specific type of handwritten or printed text in a certain language, and often a certain time period or genre too.

If you want to transcribe a document with Transkribus, you first need to upload a scan of the document and then you choose a model. There are currently 94 public models available, which are all completely free to use. Transkribus will take the information stored in that model and apply it to your document, creating an instant transcription.

But what if there isn’t a model that is suitable for the text in your documents? Then you also have the chance to train your own. To do this, you need a series of pre-transcribed documents, collectively known as “Ground Truth”. The more Ground Truth you use to train your model, the more information it will contain and the more accurate it will be when transcribing new documents. To save time, many people use a public model as the base for their custom model and then fine-tune it with further Ground Truth.

For more information about models and how to train them, check out our Help Center.

Upload a document and give Transkribus a try:

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

Ethiopic, Hebrew, Devanagari, Balinese and Ottoman Turkish: 5 public AI models for transcribing non-Latin scripts

Ottoman Turkish Print

Ethiopic – Classical Ethiopic scripts from Ethiopia and Eritrea

Devanagari Mixed M1A

Hebrew Script Languages

Balinese Palm-Leaf Manuscripts 16th Century

How do I use a public AI model with Transkribus?

For more information about models and how to train them, check out our Help Center.

Upload a document and give Transkribus a try:

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community