Preserving Cultural Heritage: Transkribus Integration with Wikimedia Projects

October 5, 2023
News, Transkribus

As of July 2023, Transkribus is proud to be a text recognition engine on Wikisource, which is an online digital library of public domain and freely licensed source texts and historical documents, and a sister project of Wikipedia.

Preserving and sharing historical knowledge is more important than ever, but the task of transcribing and making historical manuscripts accessible is not without its challenges, which is why innovative organisations join forces towards a common goal.

The Wikimedia Foundation — the nonprofit that operates Wikipedia, Wikisource, and other free knowledge Wikimedia projects — and Transkribus have recently started an exciting collaboration that began with the Wikisources Loves Manuscripts project, which is inspired by the digitisation and transcription of historical Balinese manuscripts. In this article, we will explain how this partnership came about and look at how Transkribus can benefit the Wikisource community. Additionally, we will show you how to use Transkribus within the Wikisource platform for a seamless transcription process.

*Wikisource loves manuscripts vertical logo. Via* *Wikimedia Commons* / *CC BY-SA 4.0*

‘Wikisource Loves Manuscripts’ and Transkribus

The Wikisource platform has a vast collection of historical documents, including printed and handwritten sources. The way people can contribute to making all those sources accessible is by either transcribing them manually or using the Wikimedia Optical Character Recognition (OCR) tool to transcribe the pages.

The ‘Wikisource Loves Manuscripts’ project was launched on the 24th International Mother Language Day with the initial goal of digitising and transcribing 20.000 pages of Indonesian manuscripts and making them available on Wikimedia projects. However, when it came to transcribing the Balinese manuscripts, the OCR integrations of Wikisource did not specifically support handwritten texts. Transkribus proved to be a great fit especially due to its Handwritten Text Recognition (HTR) capabilities that now allow the global Wikimedia volunteer community to create and improve text recognition models based on the handwritten texts of their choice.

The Wikimedia Foundation contacted Transkribus about the possibility of working together. Since both the Wikimedia Foundation and Transkribus share the mission of preserving and making cultural heritage accessible to future generations, we, at Transkribus, were happy to collaborate with and fully support the ‘Wikisource Loves Manuscripts’ project.

*Writing on a palm-leaf manuscript. Tropenmuseum Collection. Public domain, via* *Wikimedia Commons* / *CC BY-SA 3.0*

Transcribing Balinese Palm-Leaf Manuscripts

As a traditional writing system from Bali, Indonesia, the Balinese script consists of 47 letters and was used for Balinese, Old Javanese and Sanskrit texts. While today Balinese is mostly written in Latin script and fewer people are familiar with the Balinese script, it remains culturally significant as it is used in traditions such as the creation of palm-leaf manuscripts known as lontar, which preserve religious and literary texts for centuries.

Transkribus differs from standard OCR, in that it uses HTR technology to scan entire lines of text at once, making it particularly well-suited for deciphering complex Balinese manuscripts. While standard OCR works well with printed materials and widely used languages, it struggles with smaller languages and unique handwriting. In contrast, the main advantage of Transkribus is its ability to train models for any script and language, even less common ones. This makes Transkribus an ideal solution for preserving and transcribing historical documents from diverse linguistic backgrounds. This integration of Transkribus’ advanced AI technology into Wikisource makes it available to volunteers as they work to ensure efficient and accurate transcription of historical manuscripts, supporting multiple languages and scripts. Thanks to this specialised approach, Transkribus was able to successfully support the transcription of Balinese palm-leaf manuscripts.

Training a Transkribus Model together with IIIT Hyderabad

The journey to make the Balinese manuscripts accessible started with a “Wikisource Loves Manuscripts” crowdsourcing project for their transcription. Transkribus then offered to first train a custom text recognition model specifically for the Balinese manuscripts and provided over 60.000 free Transkribus credits to support the project. Later, the Wikisource community further improved the model and made it public. With a Transkribus text recognition model that is trained to recognise the handwriting styles and language scripts, the result would be an integrated Transkribus engine that can automatically transform scanned handwritten pages into digital text, which Wikimedia volunteers review and improve.

To start the training process a P2PaLA (Page to Page Layout Analysis) model served as the starting point to detect the text regions of the palm-leaf manuscripts. From there, a baseline model was trained with 50 pages of Ground Truth, meaning 50 pages of accurate and verified transcribed text. Based on this, a text recognition model was developed that performed well enough to transcribe the Balinese manuscripts. Together, the Balinese Wikimedia community, the Wikisource team, the team from the International Institute of Information Technology Hyderabad led by Dr Ravi Kiran and Transkribus made improvements to the Balinese model. As there is always room for improvement, work on this model is still in progress. In addition to this, a new Javanese model is now also being developed.

As the last step, Transkribus provided instructions to the Wikimedia Foundation Culture & Heritage team and the team at the IIIT Hyderabad working on the project on how to train their own models, allowing them to handle future transcriptions independently.

*Screenshot of Letter from Aubrey Hall to Helen, 1935-12-24, p6.png. Via* *Wikisource* / *CC BY-SA 4*.

How to use Transkribus in Wikisource

The Wikimedia Foundation successfully integrated Transkribus’ text recognition technology into the Wikisource platform. Now, users have the option to select which HTR/OCR system they wish to use for transcribing historical documents. This integration enables direct transmission of images to Transkribus servers, returning accurate transcription, and further streamlining the process.

Transkribus is currently available on 27 different language-versions of Wikisource and before transcribing you need to upload the scanned documents on WIkimedia Commons. Then you can start transcribing documents using Transkribus as the text recognition engine by clicking the ‘Transcribe Text’ drop-down menu at the top left of the text editor. For further clarification, have a look at this Wikisource information page.

A Collaborative Effort

Looking ahead, the collaboration between the Wikimedia Foundation and Transkribus opens up new possibilities as this integration helps optimise the digitisation process, making historical content more accessible to the global Wikimedia volunteer community. With the transcription of Balinese manuscripts, this collaboration can be seen as a successful example of preserving and sharing cultural heritage in the digital age.

The success with these manuscripts has sparked the idea of expanding this initiative to include other manuscripts within and beyond Southeast Asia, preserving the rich cultural heritage of the region and making it available to a global audience.

Thumbnail Image: Wikisource loves manuscripts vertical logo. Via Wikimedia Commons / CC BY-SA 4.0

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

Preserving Cultural Heritage: Transkribus Integration with Wikimedia Projects

‘Wikisource Loves Manuscripts’ and Transkribus

Transcribing Balinese Palm-Leaf Manuscripts

Training a Transkribus Model together with IIIT Hyderabad

How to use Transkribus in Wikisource

A Collaborative Effort

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community