+ 2017 – a good year for the READ project!

January 24, 2018
News, Transkribus

It’s hard to believe it but we’ve entered the third year of the READ project! Like most of the world, we’re using January as an appropriate time to look back on some of our achievements from the past year and think about where we will be headed in the upcoming months.

Research

Research is integral to the READ project and our research teams make it possible for Transkribus users to apply machine learning to achieve the automated recognition, transcription and searching of historical documents. Techniques of Handwritten Text Recognition, Image Enhancement, Layout Analysis, Document Understanding and Writer Identification are all being refined in READ and the results of this research are being integrated into the Transkribus workflow. To find out more about the research underpinning READ project technology, have a look at our Deliverables (reports submitted to the European Commission) under Work Packages 6 and 7. Competition is also an important part of life for our computer scientists; official contests provide one of the most effective means of testing and improving the latest technological innovations. READ researchers are at the forefront of the field and have enjoyed successes in notable competitions. The READ project has also launched ScriptNet, a new platform where computer scientists can participate in existing competitions or organise their own.

Services

The Transkribus platform is at the heart of the READ project and it provides a complete and reliable workflow for the training of Handwritten Text Recognition models and the automated transcription and searching of historical documents. The Transkribus developers at the University of Innsbruck are continually fine-tuning the tool, adding new functionalities, fixing bugs and dealing with user requests. Bravo!

Archives, libraries, individual scholars and research teams from across the globe have been working with Transkribus to recognise and transcribe diverse collections, including challenging material such as medieval texts or Arabic scripts. In the best cases, Transkribus can produce an automated transcript with a Character Error Rate of 5% (meaning that 95% of the characters in a given transcripts would be generated correctly by the computer). Check out our latest success stories to find out more about our most significant results! Transkribus users can now access publicly available models capable of transcribing eighteenth-to-nineteenth-century documents written in German or English with respectable levels of accuracy. In 2018, we hope to make more models publicly open so users can easily try out the technology on different scripts and languages.

Two exciting new features have recently been made available in Transkribus – on the basis of technology developed by the CITlab team at the University of Rostock. Automated Layout Analysis is now more accurate, meaning that even complicated pages like the one below can be divided into lines automatically with considerable precision. Keyword Spotting is a completely new searching tool in Transkribus. This is a sophisticated form of keyword searching where the technology analyses images of writing, rather than searching through transcriptions of these words generated either by humans or computers. This tool has the potential to facilitate the searching of vast collections of documents that have not yet been transcribed.

Document segmented into lines using prize-winning CITlab technology. Cologny, Fondation Martin Bodmer, Cod. Bodmer 28, f. 1r – Latin Bible (available via e-codices: http://www.e-codices.unifr.ch/en/list/one/fmb/cb-0028) [Image released under CC-BY-NC licence]

In addition to the Transkribus expert client, READ is building an assortment of new research tools that make use of the same technology. READ developers have created a beta version of Transkribus web, a lite version of Transkribus which allows users to transcribe documents online. The Computer Vision Lab at the Technical University Vienna have built functional tools for digitising documents with a mobile phone. Available now, DocScan is a free Android mobile app for taking high-quality photos of documents using a mobile phone. The Computer Vision Lab have also created protoypes of a ScanTent, a piece of equipment designed to hold a mobile phone in place above a document. With DocScan and the ScanTent, users will be able to digitise documents on demand and use these images in Transkribus for automated processing or further research. Another test-run of ScanTents will be manufactured in 2018 – you can register your interest at the ScanTent website.

Markus Diem (Computer Vision Lab, Technical University Vienna) demonstrates the ScanTent to the undersecretary of state of Austria (Muna Duzdar) and the executive committee member of the Public Service Union Austria (GÖD) (Dr. Norbert Schnedl).

We also have new websites to showcase our technology in the form of Transkribus Learn and FamousHands. The former is an e-learning app where users can practice transcribing documents and hone their paleography skills. The latter is a public collection of images of the handwriting of famous individuals (including Hans Christian Andersen, Nikola Tesla and Diana Princess of Wales). These images can be used as a starting point for Writer Identification technology.

Network

Collaboration is one of the most rewarding aspects of the READ project and we are building a global user network. More than 8000 people have registered for a Transkribus account, 64 institutions and projects have signed a Memorandum of Understanding with us and 80 highly-engaged users came together at the first Transkribus User Conference, which was held in Vienna in November 2017. Our broad user group provides us with a huge variety of documents including Swedish folklore, Italian music, medieval charters and University records. Our algorithms of of machine learning are strengthened with every piece of training data that is submitted to our platform. Put simply – the more users, the better the technology!

We continue to spread the word about the READ project on and offline; at regular workshops and big conferences like the International Medieval Congress in Leeds and Digital Humanities 2017 (last year in Montreal), as well as through our blog, wiki page, Twitter account and YouTube channel. Give us a like and a follow if you can 🙂 Traditional media is starting to take notice of the READ project too – we have recently been on national TV in Finland and national radio in Serbia.

Maria Kallio speaks at the International Medieval Congress on using Transkribus to make a digital edition of records from a Brigittine monastery

In terms of our research outputs, we are working to ensure that where possible our project publications are Open Access, our research tools are Open Source via Github and our published research data is being made available in Zenodo.

What’s next?

Our research teams will continue to work at the cutting-edge of the field of Automated Text Recognition. The Transkribus platform will be maintained and updated, as will new tools like the ScanTent and Transkribus Learn. We look forward to see how READ project technologies cope with the challenging documents that interest both new and existing users of Transkribus and we are confident that our innovations can facilitate discoverability and research on a grand scale. And we look forward to seeing you at the second Transkribus User Conference which will take place later in 2018 – announcement coming soon!

Want to find out more?

You can find more detailed summaries of the work that READ has completed in these different areas by taking at look at the latest reports (deliverables) that we have submitted to the European Commission.

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

+ 2017 – a good year for the READ project!

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community