Understanding the application of Transkribus in research

By Joe Nockels, University of Edinburgh

As part of his PhD research at the University of Edinburgh and National Library of Scotland (NLS), Joe Nockels underwent a systematic review of how Trankribus was being mentioned in research from 2015-2020. A total of 381 papers were gathered from Google Scholar, Scopus, and Web of Science. Here are a few of the findings:

Published research mentioning Transkribus is international and growing.

There was a 235% increase in Transkribus research between 2017 and 2018, returning 39 and 92 works respectively. 2019 saw another increase in materials mentioning Transkribus, rising to 112. With the analysis ending in October 2020, understandably fewer works were, although the amount of research was still sizeable with 99 materials. Transkribus’s rise in mentions eludes to a shift in the collection and curatorial landscapes, with memory institutions opting for the digitisation and recognition of their materials en masse (Chassanoff 2013; Duff et al. 2004). It is a reasonable assertion to think that this has only increased during and after the Covid-19 pandemic, which inadvertently provided a window for archives and libraries to prioritise digital projects, delivering services through multiple channels for those without access to buildings.

English was the most common language of Transkribus research (67.98%, n = 259). Transkribus research appears more multilingual than the general content in Scopus; where 88.4% of results, 77% of art and humanities materials, appeared in English in 2013 (Van Weijen 2013). More recent studies have shown that the prevalence of Standard English is on the rise in scholarly research, irrespective of field, estimating that 98% of publications in science are written in English, causing researchers from English as a Foreign Language (EFL) countries to sound the alarm that their contributions are being inhibited (Flowerdew 2013; Ramirez-Castaneda 2020). We should remain mindful of this trend, although English materials mentioning Transkribus sits alongside works in Dutch, Spanish; Swedish, Bosnian, Russian, Norwegian, Polish, Italian, Croatian, Hungarian, Czech and even Maori, showcasing the diversity of those researching Transkribus, as well the broad array of their research.

Transkribus research (2015-2020) plotted using Digimap’s roam feature. Location markers identify the lead researchers’ affiliation. The Innsbruck server is located with a yellow pin.

Plotting returned works geographically, we can see that materials mentioning Transkribus gravitate around the servers held at Innsbruck, with German institutions the most represented. That said, research avenues have opened up in West Asia particularly.

Members of the READ-COOP play a large role in publishing research using and mentioning Transkribus.

118 works (30.97%) came from READ-COOP institutions. This highlights that the COOP is becoming a strong space for like-minded institutions to feature Transkribus, presenting results and increasing its capabilities as a tool.

Chart highlighting the research output of institutional READ-COOP members by number of publications.

Transkribus is mainly mentioned in journal articles (42.87%, n = 163), conference papers (21.00%, n = 80) and policy documents (8.14%, n = 31). Although research using and mentioning the software appeared in greyer materials like book sections, undergraduate and postgraduate theses, reports, presentations, blog posts, magazines and video recordings. Out of the entire dataset, 71.13% of works were accessible online under an open copyright license (n= 271), replicating to an extent the atmosphere of collegiality and cooperation seen with the establishment of the shareholding model of the READ-COOP. This is a positive sign, enabling greater research to be done through the sharing of publications, models and methodologies.

Transkribus features primarily in archival and library science publications, while a broad range of disciplines are using HTR to a lesser extent such as: history, computer science, citizen science, law and education. This shows the wide applicability of the HTR tool, providing a host of useful tools at researchers’ disposal: improving collection descriptions, information retrieval and the recognition of historical documents. Transkribus is useful regardless of users’ concept of data-driven methods. A librarian may want to use HTR not to produce full transcriptions but to keyword spot metadata across collections, improving access to historical materials. By contrast, individual researchers tend to use Transkribus to produce rich data that is ‘replete with enough specifics that they may operationalize that data in pursuit of their research goals’. In terms of the domain of computer science, research into Transkribus is more dependent on gaining predictable and regularised results (such as a certain level of character/word error rate) (Lincoln 2017, p 30). Transkribus is flexible enough to meet all these specific needs.

Since 2017 especially, research into Transkribus has been undertaken in a variety of fields. While archival science, information science, and computer science remain the dominant disciplines, we found that work has been published across the arts, humanities and social sciences (AHSS). Fields that were represented in our corpus included religious studies, publishing, history, theatre studies, philosophy, management science, and medieval studies.

Taking all these findings into account, Transkribus appears to be as Thylstrup (2019) describes a ‘bottom-up’ mass digitization movement, made up of hundreds of simultaneous projects driven by motivated researchers. This provides a firm base for the READ-COOP, increasing the accuracy of Transkribus by supplying ground truth data and trained models. As Transkribus grows, more research will inevitably be produced and new rhythms will emerge in the approaches of researchers. Carrying out subsequent analyses of the literature will allow this to be tracked and understood. Through such a structure, the evidential value of user experiences with HTR can be harnessed and collected, helping to develop Transkribus in a sustainable and useful manner.

For the full paper see: https://link.springer.com/article/10.1007/s10502-022-09397-0

Chassanof A (2013) Historians and the use of primary source materials in the digital age. Am Arch 76:458–480. https://doi.org/10.17723/aarc.76.2.lh76217m2m376n28

Duff W, Craig B, Cherry J (2004) Historians’ use of archival sources: promises and pitfalls of the digital age. Public Hist 26:1–10

Flowerdew J (2013) Some thoughts on English for Research Publication Purposes (ERPP) and related issues. Cambridge University Press, Cambridge

Lincoln M (2017) Ways of forgetting: the librarian, the historian, and the machine. In: Padilla T, Allen L, Frost H, Potvin S, Russey RE, Varner S (eds) Always already computational: library collections as data. Institute of Memory and Library Services, National Forum Positional Statements, pp 20–30. https://collectionsasdata.github.io. Accessed 20 Nov 2020

Ramirez-Castaneda V (2020) Disadvantages in preparing and publishing scientific papers caused by the dominance of the English language in Science: the case of Colombian researchers in biological sciences. Paper presented at PLoS One, Kyoto, 16 Sept 2020. https://doi.org/10.1371/journal.pone. 0238372

Thylstrup NB (2019) The politics of mass digitization. MIT Press, Cambridge

van Weijen D (2013) Publication languages in the arts & humanities. Res Trends 32:1–10


Recent Posts

August 12, 2022
Handwritten Text Recognition
Ever had trouble reading someone else’s handwriting?  Well, it may reassure you to know that it’s not only humans that ...
July 22, 2022
The latest version of Transkribus Lite is here and brings a number of new features. Here are the most important ...
July 4, 2022
HTR models
The latest addition to the long list of Transkribus public models comes from the National Archives of Norway. Thanks to ...