The Gender History research group at the University of Jena (Thuringia, Germany) have been experimenting with Transkribus as part of a digital edition project on the correspondence of the eighteenth-century regent, Erdmuthe Benigna von Reuß-Ebersdorf (1670-1732).
Early Modern scripts are very challenging for Automated Text Recognition technology because letters tend to be closely intertwined, abbreviations occur quite often and the spelling of words is not standardized. As the below example suggests, Erdmuthe’s writing is not easy to follow! She had a unique writing style and often broke words into separate parts.
In order to train a model to recognise Erdmuthe’s writing, the Gender History research team used about 250 pages of existing transcripts that had been produced in the course of their work on the digital edition. They also used these same transcripts to create a dictionary of Erdmuthe’s vocabulary that can be integrated into the recognition process.
The resulting model is capable of producing automated transcripts of Erdmuthe’s writing with a Character Error Rate (CER) of below 9%. When a dictionary is included in the recognition process, the errors are reduced still further.
Martin Prell from the project team has elaborated on this experiment in a report (in German). He covers the experience of preparing training data for text recognition and working directly with Transkribus. If you are thinking about using Transkribus for your own project, this very instructive paper could help!
Report:
Other links: