We are happy to present one of our public models, which is the “Noscemus GM v1”-model released by Stefan Zathammer as part of the Innsbruck based project NOSCEMUS (Nova Scientia: Early Modern Scientific Literature and Latin). This model can read texts set in Antiqua-based typefaces from the 16th, 17th and 18th century, outperforming most standard OCR engines. Although it is tailored towards transcribing (Neo-)Latin texts, it provides convincing results also for other languages such as French, Italian or English. The Noscemus model can therefore not only provide help for Neo-Latinists, but for all kinds of research dealing with big text corpora from the Early Modern Period.
The model is based on training data coming from the Digital Sourcebook of the project and comprises about 1,000 pages. In order to keep the model as flexible as possible, standardizations in the transcription process were kept to a minimum. Only in the following cases normalizations were made: ligatures (e.g. ae, oe, ct, ff) and abbreviations (e.g. -que, -us, -tur, …mm…) were expanded, long s (ſ) was transcribed as normal s, small caps were transcribed as majuscules.
Even though the model provides already good results, the project is still dealing with a few issues: there are some remaining inconsistencies in the transcription of quotation marks and the error rate for the transcription of Greek words or passages is still high, to a smaller degree the same applies to (German) Fraktur.
We hope that the Noscemus-model will make transcription-life easier for many of you and for all those working on different kinds of documents, don’t forget to have a look at the other models we have been able to publicize recently thanks to our hard-working users. An overview about all our public models you can find in this document: https://transkribus.eu/wiki/images/d/d6/Public_Models_in_Transkribus.pdf