Two partners in the READ project network have now successfully trained a new model to recognise Gothic handwriting! The State Archives of Zurich (READ project partner) and the University of Zurich (READ project Memorandum of Understanding partner) have collaborated on the automatic recognition of a collection of medieval charters.
In 1336 a cartulary was written in Königsfelden, close to the city of Brugg (which is now part of Switzerland). Königsfelden abbey was a well-endowed institution with close ties to the dukes of Habsburg. In a neat and regular handwriting, the charters of the institution were copied on roughly 260 parchment pages. The cartulary is available online via e-codices.
At the University of Zurich, there is an ongoing project to create a digital scholarly edition of the charters of Königsfelden abbey. The cartulary is an important source for early writing practices and has already been partially transcribed. The project team have been using our Transkribus platform to produce their transcriptions and they used these transcripts to train and test a Handwritten Text Recognition (HTR) model.
The model was trained on transcripts of around 26,000 words from the charters. These documents are written in a regular script, with evenly ruled lines and this helps the technology to process the pages more easily. The HTR model is able to automatically produce transcripts of documents in the collection with an astonishing Character Error Rate (CER) of 10%.
Transkribus has been able to deal with some of the intricacies common to medieval documents. Thanks to the integration of Unicode, superscripts on letters, such as uͤ can also be recognized by the HTR. Don’t expect this recognition to work perfectly, the signs are sometimes so small that even expert paleographers debate their meaning!
Furthermore, one of the main problems regarding pre-modern handwriting could partially be dealt with: Abbreviations were indicated in the process of transcription by using combining diacritics such as ‘ ̄ ‘ (U+0305 combining overline) or entering correct signs from Unicode.
Since the transcripts provided as training data were consistent, the automatic recognition of abbreviations (or rather the correct transcription using abbreviation signs) could in some cases be achieved. In order to produce easily legible transcriptions or even scholarly editions, these signs can be searched and replaced in Transkribus or in another editor in a later stage.
For two reasons, it was decided not to integrate dictionaries to try to enhance the accuracy of the model. First, medieval texts tend to be full of different variants. The same word can occur in the same text, with various different spellings. Second, in the cartulary, as in other medieval documents, Latin and the vernacular (in this case middle German) are mixed. Despite the lack of a dictionary, the HTR model was still able to recognise these documents at a high level of accuracy.
In the future, we hope to be able to create general models that can be applied to regular handwriting as found in medieval books and charters. All that is needed is a large amount of training data from different medieval documents. So, come join us and start to train your own HTR model!
By Tobias Hodel, University of Zurich.