Yes, you read that correctly – our Transkribus platform can indeed recognise printed Indian texts.
Conventional OCR software usually struggles to decipher the complexities of South Asian scripts. Two projects have recently been working with nineteenth-century printed texts in Transkribus with the hope of getting better results. Using images and transcripts from a collection, Transkribus users can train a model to recognise printed text of any type.
First of all, The British Library’s Two Centuries of Indian Print project is creating a digitised collection of works published in South Asia in the eighteenth and nineteenth centuries. The project team trained a text recognition model in Transkribus with 50 pages (containing 5,700 words) of digitised images and transcripts from Bengali books. The resulting model can produce transcripts of page from the collection with an average Character Error Rate of 21%. Although this is a relatively high error rate, the team are planning to retrain the model by creating more pages of training data and focusing on improving the recognition of elements of the Bengali characters which were sometimes missed by the software.
The Naval Kishore Press was a nineteenth-century publishing house which brought works on various subjects to market in Hindi, Urdu, Arabic, Persian and Sanskrit. Part of its output are held by the library of the South Asia Institute (SAI) at Heidelberg University. The South Asia Institute library and Heidelberg University Library are collaborating on the Naval Kishore Press – digital project, working to produce digitised and machine-readable text for a selection of texts published by this press. The project team used 200 pages of images and transcripts to train a model in Transkribus to recognise Hindi and Sanskrit text. This model can produce transcripts of the collection with a Character Error Rate of around 5%. Fully searchable images and transcripts from the collection are now available to consult, download and annotate on Heidelberg University library’s online catalogue.
Read more: