There are now thousands of Transkribus users working with documents of all kinds of dates, languages and formats. Today we would like to highlight some of the great work on the first Automated Text Recognition models for Danish handwriting.
Vagn Mørkeberg Christiansen is a retired volunteer at the Faxe Municipality Archives in Denmark. The archives were interested in using Transkribus to open up a collection of early twentieth-century minutes for transcription and searching. Vagn was invited to undertake this experiment.
Vagn used Transkribus to create training data for Automated Text Recognition by transcribing a few hundred pages from a collection of minutes from the parish of Braaby. These minutes were written between 1912 and 1931 by J. P. Jensen and O. Christov, who were both chairmen of the local council. Both individuals wrote relatively clearly, although the documents contain a few complications such as abbreviations and similarities between different characters.
At the latest count, Vagn has transcribed around 325 pages in Transkribus. These pages were used to create three text recognition models for the two different hands in the collection.
The first model was trained on 17,500 words of Jensen’s writing and the results were promising. Automated transcripts generated with this model reached an average Character Error Rate of 7.7%.
The next two models were trained on Christov’s writing, the first with around 16,000 words and the second with some 23,000 words. Happily, there was a significant improvement in the results of automated transcription when more pages of training data were used. The average Character Error Rate of the automated transcripts fell from 9.9% to 4.7%.
These figures represent very good results for Automated Text Recognition. Transcripts with these kinds of Character Error Rates can be easily read, searched and corrected.
The improvement in the model trained to recognise Christov’s handwriting is also an excellent demonstration of the big data approach behind Transkribus. The more images and transcripts submitted to our platform as training data, the more accurate the recognition can become.
Vagn is enthusiastic about these results and plans to keep transcribing and training models. His next target is to retrain the Christov model once again – this time with 40,000 transcribed words!
If you would like to train your own Automated Text Recognition model in Transkribus, take a look at the How to Guides on the Transkribus wiki.
We are also working on a beta version of Transkribus Web, a streamlined web version of Transkribus where volunteers like Vagn will be able to transcribe training material for text recognition more easily.
We would like to thank Vagn Mørkeberg Christiansen for providing the information for this news post.