This is a short introduction to marking up and exporting tables as well as to do the semi-automatic table processing using Transkribus and nomacs. Segmenting printed or hand-drawn tables using the Table Editor in Transkribus will add graphical lines into your image and assign a tabular structure to the layout of your documents. It also allows you to export your transcriptions as a Microsoft Excel spreadsheet. This guide applies to images in one Transkribus document following the same table print or template.
Printed and hand-drawn tables are common in historical documents of all types. Such tables can be marked up in Transkribus, either as the first stage in creating training data for Automated Text Recognition or simply to ready the documents for manual transcription.
Currently, tables must be manually drawn using the Table Editor in Transkribus. Technology which will allow the automated recognition of tables is under development and will be made available to users soon.
Often multiple pages follow the same table print or table template, so the table mark-up only has to be done for the first occurrence of the same print and can be distributed to the remaining pages using the nomacs toolkit.
The first section of this guide describes the manual creation of a table structure in Transkribus and the transcription of the text it contains. The second section gives instructions on working with table templates, which were created in Transkribus, and how to apply them to several pages using a method called batch processing in the nomacs tool.
Finally, this document also explains how tables can be exported for further data processing in standard spreadsheet tools.
Create text regions
- First, create text regions for any information not belonging to the table.
This refers to information at the top, bottom or the sides of the page which is clearly not part of the table such as:
- Page numbers
- Line numbers
- Any other markings or annotations
- For more information on creating text regions, see the “Segmentation” section of How To Transcribe Documents with Transkribus – Introduction.
Create the table
- Select the “Add other item” button in the Canvas menu and then click “Add a table”
- Click on the top left corner of the table in the image and then click on the bottom right corner
Segment the table
You can now segment your table into rows and columns
- To begin, make sure you are in “Selection mode”. Press the “ESC” key on your keyboard or click the “Selection mode” button in the Main menu.
- Click on the table region that you have created.
- To create rows, click the “Splits a shape with a horizontal line” button in the Canvas menu.
- Move your cursor across the page and click wherever you want to create a horizontal line.
- To create columns, click the “Splits a shape with a vertical line” button in the Canvas menu.
- Move your cursor across the page and click wherever you want to create a vertical line.
- Continue until all table cells are marked.
Note: Depending on the layout of your table, you might want to treat the spine of the book like an extra column (as in Figure 1). You can also mark-up this column on table cell level using the “book-binding” tag in the “Metadata/structural” tab.
Copy the table format from one page to another
If the table layout of several pages is similar, it is possible to transfer the table format from one page to other pages. To do this, proceed as follows:
- Prepare the table layout as mentioned above
- Open “other segmentation tools” via the “Canvas”-menu
- Choose “Copy regions (texts or tables) to other pages”
- Define the pages the layout should be copied to in the appearing window.
- Confirm with “OK” and the table layout will be copied to the indicated pages.
- To defenitely run the tool, unselect “Dry run”.
- It might be, that the position of the table on the page will need to be correct. To do so please select the whole table and then move it by holding the “Ctrl” + “Shift” on your keyboard.
Correct the cells in the table
In some cases, it may be necessary to merge cells together in order to reflect cells spanning multiple rows or columns.
- Make sure you are in “Selection mode” by pressing the “ESC” key on your keyboard or clicking the “Selection mode” button in the Main menu.
- To select cells to merge, hold down the “CTRL/CMD” key on your keyboard and then click on the relevant cells in your table.
- Click the “Merges the selected shapes” button in the Canvas menu.
- Continue with all cells until the expected structure is achieved. In the example below, merging must be completed for each of the highlighted set of cells.
If you focus on having the perfect table segmentation, it may also be necessary to correct the shapes of some of the cells in your table. The segmented green lines should then correspond to the lines of your table as far as possible. In order to do so,
- Select the table cell you wish to edit
- Click and drag the big green dots to move the position of the lines
Note: For export and automatic processing, having straight, rectangular lines close to the original table borders is perfectly sufficient.
Add graphical information
Cell borders (graphical lines) need to be marked when they are visible.
- Right-click on the cell that you wish to mark up
- Click “Mark-up borders” in the pop-up menu or use the button to open the border markup menu
- Choose the correct options to describe the border of the cell
Note: You can choose multiple cells at once by choosing “Select all cells” or “Select row cells”. Selecting or de-selecting cells works by holding down the command (Ctrl) key and clicking on a different cell.
The next step is to add baselines to your table. The baselines should reflect the logical flow of text and can therefore run over the cell borders if necessary.
- You can either draw the baselines by hand or use the automatic baseline detection tools in Transkribus. When using the Layout Analysis to autmatically detect the baselines, please make sure to unselect “Find text regions”.
Note: the line finding tool created by the Computational Intelligence Technology Lab at Universität Rostock is currently the most effective for the automatic recognition of baselines in tables. In the “Layout Analysis” section of the “Tools” tab click “Method: CITlab Advanced”.
- If you detect baselines automatically, you may need to correct the generated lines or move them to the correct cell
- You may also want to check the reading order and correct your baselines. For more information on adding and correcting baselines, see the “Segmentation” section of How To Transcribe Documents with Transkribus – Introduction.
Correct baselines stretching more than one cell
You may find that the automatic layout tool on table cells strictly obeys the cell borders. Baselines stretching multiple cells are divided. You can use the merging tool to combine those partial baselines. In case you want to merge baselines stretching more than one cell, move them first to the same cell, select them and use the merge tool
- Open the “Layout tab”.
- Click on the first cell in the image where your baselines should be placed. This will highlight the respective position in the structure tree.
- Expand the arrows to display the Line elements.
- Select the lines you want to move from the tree holding down the Ctrl key
- Drag the Lines to the correct cell
- Use the merge tool to fix the layout of the lines
Transcribe the table headers
Especially for given forms or tables, the headers remain the same over several pages. Any transcribed information contained in the table template will automatically be carried over by the table-matching tool.
Example results of table markup
Train column structure with P2PaLA
The P2PaLA-training feature can be used to train columns of the tables in your documents.
Before starting the training the tables need to be prepared:
- Draw a text region for each column.
- Define these regions with structural tags via the “Metadata” and “Structural”-tab. How this is done you can read in the structural tagging guideline.
- Important is, that every column has it’s own structure type.
- In order to speed up the creation of training data you can copy the layout to following pages as described above.
This approach is especially useful if you are interested in special columns in the document (so not all of the columns in the tables).
- Transcribe the text of your table exactly as it appears in the image
- Click on a cell in your table to start transcribing and then move through the other cells in your table
- If you are transcribing text as training data for Automated Text Recognition, the reading order of your transcription is not important
- If you are transcribing text for research purposes, you may want to adjust the reading order of the baselines
- You can also run an automated text recognition (HTR model) on your segmented document. For more information see How To Train A Handwritten Text Recognition Model in Transkribus
- In the next section of the guide, you will find out how to create a table template that can recur across several images in your document.
Once you have segmented and transcribed a page, you can export the results of your transcribed tables into XLS format.
- Click the “Export Document” button in the Main menu
- At the top of the box, select the location where you would like your exported files to be saved
- In the “Choose export format” section, select “Table Export into Excel”
- At the bottom right of the box, make sure that you select the number of pages that you wish to export.
- Export a single page
Note: Only tables and their contents are exported, text regions will be ignored. If your selection of pages contains no tables, Transkribus will show you an error message and stop the export process.
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.
Transkribus is made available to the public as part of the H2020 e-Infrastructure Project READ (Recognition and Enrichment of Archival Documents) which received funding from the European Commission under grant agreement No. 674943.