It’s hard to believe it but we’ve entered the third year of the READ project! Like most of the world, we’re using January as an appropriate time to look back on some of our achievements from the past year and think about where we will be headed in the upcoming months.
Research is integral to the READ project and our research teams make it possible for Transkribus users to apply machine learning to achieve the automated recognition, transcription and searching of historical documents. Techniques of Handwritten Text Recognition, Image Enhancement, Layout Analysis, Document Understanding and Writer Identification are all being refined in READ and the results of this research are being integrated into the Transkribus workflow. To find out more about the research underpinning READ project technology, have a look at our Deliverables (reports submitted to the European Commission) under Work Packages 6 and 7. Competition is also an important part of life for our computer scientists; official contests provide one of the most effective means of testing and improving the latest technological innovations. READ researchers are at the forefront of the field and have enjoyed successes in notable competitions. The READ project has also launched ScriptNet, a new platform where computer scientists can participate in existing competitions or organise their own.
The Transkribus platform is at the heart of the READ project and it provides a complete and reliable workflow for the training of Handwritten Text Recognition models and the automated transcription and searching of historical documents. The Transkribus developers at the University of Innsbruck are continually fine-tuning the tool, adding new functionalities, fixing bugs and dealing with user requests. Bravo!
Archives, libraries, individual scholars and research teams from across the globe have been working with Transkribus to recognise and transcribe diverse collections, including challenging material such as medieval texts or Arabic scripts. In the best cases, Transkribus can produce an automated transcript with a Character Error Rate of 5% (meaning that 95% of the characters in a given transcripts would be generated correctly by the computer). Check out our latest success stories to find out more about our most significant results! Transkribus users can now access publicly available models capable of transcribing eighteenth-to-nineteenth-century documents written in German or English with respectable levels of accuracy. In 2018, we hope to make more models publicly open so users can easily try out the technology on different scripts and languages.
Two exciting new features have recently been made available in Transkribus – on the basis of technology developed by the CITlab team at the University of Rostock. Automated Layout Analysis is now more accurate, meaning that even complicated pages like the one below can be divided into lines automatically with considerable precision. Keyword Spotting is a completely new searching tool in Transkribus. This is a sophisticated form of keyword searching where the technology analyses images of writing, rather than searching through transcriptions of these words generated either by humans or computers. This tool has the potential to facilitate the searching of vast collections of documents that have not yet been transcribed.
In addition to the Transkribus expert client, READ is building an assortment of new research tools that make use of the same technology. READ developers have created a beta version of Transkribus web, a lite version of Transkribus which allows users to transcribe documents online. The Computer Vision Lab at the Technical University Vienna have built functional tools for digitising documents with a mobile phone. Available now, DocScan is a free Android mobile app for taking high-quality photos of documents using a mobile phone. The Computer Vision Lab have also created protoypes of a ScanTent, a piece of equipment designed to hold a mobile phone in place above a document. With DocScan and the ScanTent, users will be able to digitise documents on demand and use these images in Transkribus for automated processing or further research. Another test-run of ScanTents will be manufactured in 2018 – you can register your interest at the ScanTent website.
We also have new websites to showcase our technology in the form of Transkribus Learn and FamousHands. The former is an e-learning app where users can practice transcribing documents and hone their paleography skills. The latter is a public collection of images of the handwriting of famous individuals (including Hans Christian Andersen, Nikola Tesla and Diana Princess of Wales). These images can be used as a starting point for Writer Identification technology.
Collaboration is one of the most rewarding aspects of the READ project and we are building a global user network. More than 8000 people have registered for a Transkribus account, 64 institutions and projects have signed a Memorandum of Understanding with us and 80 highly-engaged users came together at the first Transkribus User Conference, which was held in Vienna in November 2017. Our broad user group provides us with a huge variety of documents including Swedish folklore, Italian music, medieval charters and University records. Our algorithms of of machine learning are strengthened with every piece of training data that is submitted to our platform. Put simply – the more users, the better the technology!
We continue to spread the word about the READ project on and offline; at regular workshops and big conferences like the International Medieval Congress in Leeds and Digital Humanities 2017 (last year in Montreal), as well as through our blog, wiki page, Twitter account and YouTube channel. Give us a like and a follow if you can 🙂 Traditional media is starting to take notice of the READ project too – we have recently been on national TV in Finland and national radio in Serbia.
In terms of our research outputs, we are working to ensure that where possible our project publications are Open Access, our research tools are Open Source via Github and our published research data is being made available in Zenodo.
Our research teams will continue to work at the cutting-edge of the field of Automated Text Recognition. The Transkribus platform will be maintained and updated, as will new tools like the ScanTent and Transkribus Learn. We look forward to see how READ project technologies cope with the challenging documents that interest both new and existing users of Transkribus and we are confident that our innovations can facilitate discoverability and research on a grand scale. And we look forward to seeing you at the second Transkribus User Conference which will take place later in 2018 – announcement coming soon!
Want to find out more?
You can find more detailed summaries of the work that READ has completed in these different areas by taking at look at the latest reports (deliverables) that we have submitted to the European Commission.