Preserving Cultural Heritage: Transkribus Integration with Wikimedia Projects

As of July 2023, Transkribus is proud to be a text recognition engine on Wikisource, which is an online digital library of public domain and freely licensed source texts and historical documents, and a sister project of Wikipedia

Preserving and sharing historical knowledge is more important than ever, but the task of transcribing and making historical manuscripts accessible is not without its challenges, which is why innovative organisations join forces towards a common goal. 

The Wikimedia Foundation — the nonprofit that operates Wikipedia, Wikisource, and other free knowledge Wikimedia projects — and Transkribus have recently started an exciting collaboration that began with the Wikisources Loves Manuscripts project, which is inspired by the digitisation and transcription of historical Balinese manuscripts. In this article, we will explain how this partnership came about and look at how Transkribus can benefit the Wikisource community. Additionally, we will show you how to use Transkribus within the Wikisource platform for a seamless transcription process.

Wikisource loves manuscripts vertical logo. Via Wikimedia Commons / CC BY-SA 4.0

‘Wikisource Loves Manuscripts’ and Transkribus

The Wikisource platform has a vast collection of historical documents, including  printed and  handwritten sources. The way people can contribute to making all those sources accessible is by either transcribing them manually or using the Wikimedia Optical Character Recognition (OCR) tool to transcribe the pages.

The ‘Wikisource Loves Manuscripts’ project was launched on the 24th International Mother Language Day with the initial goal of digitising and transcribing 20.000 pages of Indonesian manuscripts and making them available on Wikimedia projects. However, when it came to transcribing the Balinese manuscripts, the OCR integrations of Wikisource did not specifically support handwritten texts. Transkribus proved to be a great fit especially due to its Handwritten Text Recognition (HTR) capabilities that now allow the global Wikimedia volunteer community to create and improve text recognition models based on the handwritten texts of their choice. 

The Wikimedia Foundation contacted Transkribus about the possibility of working together. Since both the Wikimedia Foundation and Transkribus share the mission of preserving and making cultural heritage accessible to future generations, we, at Transkribus, were happy to collaborate with and fully support the ‘Wikisource Loves Manuscripts’ project.

Writing on a palm-leaf manuscript. Tropenmuseum Collection. Public domain, via Wikimedia Commons / CC BY-SA 3.0

Transcribing Balinese Palm-Leaf Manuscripts 

As a traditional writing system from Bali, Indonesia, the Balinese script consists of 47 letters and was used for Balinese, Old Javanese and Sanskrit texts. While today Balinese is mostly written in Latin script and fewer people are familiar with the Balinese script, it remains culturally significant as it is used in traditions such as the creation of palm-leaf manuscripts known as lontar, which preserve religious and literary texts for centuries. 

Transkribus differs from standard OCR, in that it uses HTR technology to scan entire lines of text at once, making it particularly well-suited for deciphering complex Balinese manuscripts. While standard OCR works well with printed materials and widely used languages, it struggles with smaller languages and unique handwriting. In contrast, the main advantage of Transkribus is its ability to train models for any script and language, even less common ones. This makes Transkribus an ideal solution for preserving and transcribing historical documents from diverse linguistic backgrounds. This integration of Transkribus’ advanced AI technology into Wikisource makes it available to volunteers as they work to ensure efficient and accurate transcription of historical manuscripts, supporting multiple languages and scripts. Thanks to this specialised approach, Transkribus was able to successfully support the transcription of Balinese palm-leaf manuscripts.

Training a Transkribus Model together with IIIT Hyderabad

The journey to make the Balinese manuscripts accessible started with a “Wikisource Loves Manuscripts” crowdsourcing project for their transcription. Transkribus then offered to first train a custom text recognition model specifically for the Balinese manuscripts and provided over 60.000 free Transkribus credits to support the project. Later, the Wikisource community further improved the model and made it public. With a Transkribus text recognition model that is trained to recognise the handwriting styles and language scripts, the result would be an integrated Transkribus engine that can automatically transform scanned handwritten pages into digital text, which Wikimedia volunteers review and improve.

To start the training process a P2PaLA (Page to Page Layout Analysis) model served as the starting point to detect the text regions of the palm-leaf manuscripts. From there, a baseline model was trained with 50 pages of Ground Truth, meaning 50 pages of accurate and verified transcribed text. Based on this, a text recognition model was developed that performed well enough to transcribe the Balinese manuscripts. Together, the Balinese Wikimedia community, the Wikisource team, the team from the International Institute of Information Technology Hyderabad led by Dr Ravi Kiran and Transkribus made improvements to the Balinese model. As there is always room for improvement, work on this model is still in progress. In addition to this, a new Javanese model is now also being developed.

As the last step, Transkribus provided instructions to the Wikimedia Foundation Culture & Heritage team and the team at the IIIT Hyderabad working on the project on how to train their own models, allowing them to handle future transcriptions independently.

Screenshot of Letter from Aubrey Hall to Helen, 1935-12-24, p6.png. Via Wikisource / CC BY-SA 4.

How to use Transkribus in Wikisource

The Wikimedia Foundation successfully integrated Transkribus’ text recognition technology into the Wikisource platform. Now, users have the option to select which HTR/OCR system they wish to use for transcribing historical documents. This integration enables direct transmission of images to Transkribus servers, returning accurate transcription, and further streamlining the process.

Transkribus is currently available on 27 different language-versions of Wikisource and before transcribing you need to upload the scanned documents on WIkimedia Commons. Then you can start transcribing documents using Transkribus as the text recognition engine by clicking the ‘Transcribe Text’ drop-down menu at the top left of the text editor. For further clarification, have a look at this Wikisource information page

A Collaborative Effort 

Looking ahead, the collaboration between the Wikimedia Foundation and Transkribus opens up new possibilities as this integration helps optimise the digitisation process, making historical content more accessible to the global Wikimedia volunteer community. With the transcription of Balinese manuscripts, this collaboration can be seen as a  successful example of preserving and sharing cultural heritage in the digital age. 

The success with these manuscripts has sparked the idea of expanding this initiative to include other manuscripts within and beyond Southeast Asia, preserving the rich cultural heritage of the region and making it available to a global audience.



Thumbnail Image: Wikisource loves manuscripts vertical logo. Via Wikimedia Commons / CC BY-SA 4.0

SHARE THIS ARTICLE

Recent Posts

May 2, 2024
News, Transkribus, Uncategorized
If you’re new to Transkribus, you probably have lots of questions about the platform. How do I transcribe documents? What’s ...
April 25, 2024
News, Transkribus
Back in January, we announced our new subscription plans: Individual, Scholar, and Organisation. Each plan is tailored to a particular ...
April 17, 2024
News, Transkribus
One of the biggest advantages of Transkribus is the possibility to train custom handwritten text recognition models. This unique feature ...