Last update of this guide 04/09/2020
This guide provides detailed transcription instructions for transcribing in Transkribus, providing guidance on features like abbreviations, diacritics and tags.
Users should move on to this guide after first consulting our basic transcription guide: How to Transcribe Documents with Transkribus – Introduction.
Table of Contents
Over the past few years, our transcription guidelines have been revised and simplified in line with our findings about what Handwritten Text Recognition (HTR) technology can learn to process.
This guide provides standardised instructions for transcribing historical documents in Transkribus.
Our aim is to help users produce transcripts quickly and efficiently, as the basis for strong HTR models that can recognise text with a high level of accuracy.
Users may want to produce a simple transcription that will only be used to train HTR technology to recognise their documents.
In this case, the most important consideration is to produce a consistent transcript that accurately represents the words in your document.
Alternatively, users may want to produce a rich transcription with additional tags and metadata that would be suitable for a scholarly edition.
This guide sets out common historical conventions for scholarly transcription that users can adapt according to their needs and the specific features of their documents. You do not necessarily have to change existing transcriptions that you have already completed. You may also find other effective ways of dealing with transcription issues that are not covered in this guide.
Transkribus users will also soon be able to transcribe documents in the Transkribus web interface, where it will be easier to transcribe documents in teams.
Before you begin, please check that you are working with the latest version of the Transkribus expert client:
Figure 1 Checking for the latest version of Transkribus
The text needs to be transcribed character by character, according to what is shown in the image. Since uniform spelling rules did not exist in the past, orthographic and grammatical correctness is of secondary importance.
Words should be separated or combined according to the original text, even if it is not in accordance with the current practice.
Again, the original text should be the basis for your decision. If an initial letter cannot be clearly identified as upper or lower case (majuscule or minuscule), the decision is up to you, but should be based on the current spelling rules.
When hyphenated words appear at the end of the line, they should be transcribed and broken up
according to the original text. They no longer need to be marked with a “-“ or a “tick”.
When hyphenated words appear in the middle of a line, they should be transcribed according to the original text.
Passages of text that have been struck through should be marked up using the “Tag as strikethrough”
button in the Text Editor field.
Superscript text passages (including punctuation marks) should be marked up as superscript with the
“Tag as superscript” button in the Text Editor field.
Punctuation should be transcribed using the keys on your keyboard, keeping as close as possible to the original.
For documents of the 16th century and later: Transcription should follow the original text, even if a punctuation mark has been used in a way that does not correspond to modern usage.
For transcriptions of medieval texts: Do not try to use modern punctuation. It would be better to either omit all punctuation or use specific symbols (i.e. Middle Dot, U+00B7).
Full stops often appear after numbers and headings – and they should be transcribed.
Sometimes historical documents use “/:” instead of brackets. In these cases, the “/:” should be
Underlined text passages are marked up with the “Tag as underlined” button in the Text Editor field.
Different fonts like Kurrent or Antiqua are not specially marked.
Additions between the lines are segmented as separate lines and transcribed normally, but do not have to be specially marked. What is important is that the addition is placed in the correct place in the text according to the reading order. In such cases the reading order might need to be checked and revised.
The reading order should be as follows, according to the natural reading order of a human reader:
To show the reading order options click on the “Shape visibility” button in the Main menu.
Figure 2 Checking the reading order
Select “Show baselines reading order” and numbers corresponding to each baseline will appear on your document image. Click on the number associated with a baseline to update its place in the reading order.
When interlineal additions appear, the correct reading order is: Text – Addition – Text.
Figure 3 Changing the reading order
Normally abbreviations are transcribed according to the original text, i.e. not expanded. This includes historical abbreviations and abbreviations which are still used today (e.g. contemporary currency indications, titles and salutations). Only if the expansion involves only one or two additional letters it should be carried out.
Figure 4 Example of an abbreviation: word with a nasal abbreviation on the m or n: Zim̄ er
Simple transcriptions: diacritical characters (e.g. accents, circumflexes, cedillas, hyphens, tildes) can be ignored, except for the modern German umlauts.
More elaborate transcriptions: diacritical characters are transcribed according to the written characters on the page.
The letters “i” and “j” can be used interchangeably. Again, the original text should help you make your decision. The two letters are transcribed as such, even if it does not correspond to the rules of modern orthography. Since they are often difficult to distinguish from each other (especially with capital letters), it is your own discretion or the spelling in use today that is decisive here.
Alternative practice: It is viable to only use “i” except for consonantal use of the letter.
The historical equivalents of “u” and “v” no longer exist because the letters are now used separately. Therefore, please adapt the transcription according to current usage.
Alternative practice: It is possible to use “u” and “v” as they would be read.
Ligatures are common combinations of letters to form a new character.
“St” and “Sch” ligatures and the ligatures at the end of words or abbreviations should be transcribed in full. They do not need to be marked up as abbreviations.
For example, the ligature “præs” should be transcribed as “praes”.
The letter “s” can appear in different forms. Normal and long “s” (with descender) can both get transcribed as a normal “s” or according to their shape as “s” or “I” (U-017F). Double “s” or “ß” (sharp “s” or “Eszett”) are transcribed according to the original text.
There are options for marking up the structure of your documents in the “Metadata”/”Structural” tab. The structure of the text is assigned at baseline level during the process of segmentation. You can mark up elements including page numbers, headings and marginalia and also train this layout with the P2PaLA-tool.
You can find more information about this in the How to Use the Structural Tagging Feature guide.
Personal names, places or locations, dates of various kinds as well as organizations, institutions or abstract identities can be marked with the corresponding tags.
You can find all of the tags in the “Metadata/Textual” tab.
Note: When it comes to HTR training, tags are not relevant yet. Developments in Named Entity Recognition technology should make the automated recognition of tags possible in the future.
tagged as such.
impersonal label (e.g. “the baker”), then this indefinite name is not marked.
Please mark the edited or completed pages with the corresponding attribute in the status bar in the Main menu.
Figure 5 Defining the status of the document
The following statuses can be assigned:
In Progress: pages still to be transcribed
Done: pages that have been transcribed but which still need review.
Final: transcribed pages that have been reviewed as “Final”
Ground Truth: transcribed pages that are completely finalised by the project administrator as
“Ground Truth” data suitable for HTR training. Once this status has been assigned to a page, it should
no longer be changed.
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.
Transkribus is made available to the public as part of H2020 e-Infrastructure Project READ (Recognition and Enrichment of Archival Documents) which received funding from the European Commission under grant agreement No 674943.No Comments