Transkribus at the Bibliotheca Hertziana – Max Planck Institute for Art History

Digital Publications are the latest addition to the DH Lab of the Bibliotheca Hertziana – Max Planck Institute for Art History in Rome (https://www.biblhertz.it), and the goal is to to publish the Institute’s Open Access content online.

Since lot of Art History sources are antique books, and the library of our institute has been devoting a lot of resources scanning our “Rara” books collection (http://dlib.biblhertz.it), it became natural to imagine a way to access them not only as digitized images but also as transcribed content. This will allow authors to quote them directly, but also to enhance cross referencing, content checking and accessibility for people relying on TTS tools.

Older books present several challenges to standard OCR, the technology normally used to recognize text in books. Not only are some characters and ligatures difficult to train (just think about the slight difference between the letter “f” and the elongated s “ſ”, or the use of “u” as “v” in lower letters, and “V” as “U” in capital letters), but there are also abbreviations and symbols with a special meaning. In fact, especially in the fifteen and sixteen centuries, most books contained the exact scribal abbreviations that were common in manuscripts.

This means that approaching the transcription one character at time, as OCR does, would simply create a huge error rate and no way to search for abbreviated words. On the contrary, even if HTR is intended for handwriting, it can be trained perfectly to adapt to the context, and expand abbreviations or distinguish between letters that seem or are identical.

For this reason, we teamed up with the READ-COOP and planned a complete neural text recognition of our existing digitisations. The goal is to create new models that will be able not only to transcribe all the content, but also to recognize a book’s main structure: extract the list of images, distinguish between main text and commentaries, and a lot more. The transcriptions will be available in the IIIF viewer, but also in an online Read&Search platform, together with the digitized books from the Kunsthistorisches Institut in Florence and the Max-Planck-Institut für Wissenschaftsgeschichte in Berlin, they can be searched and analysed through machine learning for data mining.

Another project relying on Transkribus is a digital edition of manuscripts, where alongside the manual transcription of the content, the tagging of information is needed. Thanks to the easy tag management available in the Expert client, the team can work together and edit the text, insert semantic information and identify relevant named entities like people, places, dates or artworks that are mentioned in the text. Thanks to direct TEI export or XSLT conversion, the digital edition can be created almost without any further post processing.

Before starting this project, I was already using Transkribus for my own research, and now I encourage Hertziana researchers to use it as much as possible when accessing content is important, or when working on a digital edition.

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

Transkribus at the Bibliotheca Hertziana – Max Planck Institute for Art History

The COOP

Products & Services

Useful information

Helpful resources

Community