A collaboration between the Bentham Project of the University College London and the DEEDS (Documents of Early England Data Set) Project of the University of Toronto uses Transkribus for the transcription of an immense corpus of medieval charters from the 12th to the 15th century. The handwritten Latin of this period is very peculiar and confronted them with two interesting questions:
- Could Transkribus be trained consistently to process abbreviated Latin words, which can represent up to half the vocabulary of medieval legal texts, and hence feature in a substantial proportion of the Documents of Early England Data Set (DEEDS) corpus at the University of Toronto?
- Could Transkribus be made consistently to recognise hyphenated words which span multiple lines of text (insofar as they are both in Latin and abbreviated)?
To find answers, the team first decided to create their own dictionary of over a hundred abbreviated Latin words, both in their abbreviated and in their expanded form. This was done with the help of the independent programmer Ismail Prada from Switzerland, who coded abbrevSolver-master, a Python script. The contracted form was represented by compatible special characters that best reflect how they appear in type. These abbreviations were also categorized as prefixes, suffixes, or standalone abbreviations, which would alter how they would be processed by the algorithm. However, the method turned out to be problematic, as several versions of the appropriate tab-separated Excel file containing the abbreviated words and several varieties of special characters had to be created in an attempt to get it to function as intended. The only way to solve this problem was to proceed with the finding-and-replacing of the abbreviated words without the use of the script by manually finding and replacing the words. This meant a very time intensive process and was not viable in the long run. With the help of Prada, however, the script was fixed and even a superior API script was developed, which is directly connected to Transkribus after giving it the collection editor’s username and password and the collection ID. The new script is quicker and simpler to use. After running a basic command, the script communicates with Transkribus and uses its find-and-replace algorithm on each subcollection, replacing each term it finds from the abbreviation dictionary with its shorter equivalent and tagging them as abbreviated. At this stage of the project, five new HTR models were created. Over the course of this project the WER and CER both declined in a very promising way and the models which were generated after the new script was created, are extremely good. Additionally, the research team used material from Oxford University and Christ Church to further expand the ground truth and was able to create two more models, which improved the testing results of the DEEDS corpus. On the way to the new models, some obstacles, such as poor image quality and the brevity of the images, made the development even more difficult. However, the #7 model is now available freely for everyone. More than 140 000 words have been trained and the CER on the Validation Set is 0.8% For more details about the project and the developed models visit their website: https://blogs.ucl.ac.uk/transcribe-bentham/2021/04/20/ucl-university-of-toronto-transkribus-htr-and-medieval-latin-abbreviations/