Last update of this guide: 04/09/2020
This is a short introduction to marking up and exporting tables as well as to the semi-automatic table processing using Transkribus and nomacs. Segmenting printed or hand-drawn tables using the Table Editor in Transkribus will add graphical lines into your image and assign a tabular structure to the layout of your documents. It also allows you to export your transcriptions as a Microsoft Excel spreadsheet. This guide applies to images in one Transkribus document following the same table print or template.
Download the Transkribus Expert Client, or make sure you are using the latest version:
Transkribus and the technology behind it are made available via the following projects and sites:
Table of Contents
Printed and hand-drawn tables are common in historical documents of all types. Such tables can be marked up in Transkribus, either as the first stage in creating training data for Automated Text Recognition or simply to ready the documents for manual transcription.
Currently, tables must be manually drawn using the Table Editor in Transkribus. Technology which will allow the automated recognition of tables is under development and will be made available to users soon.
Often multiple pages follow the same table print or table template, so the table mark-up only has to be done for the first occurrence of the same print and can be distributed to the remaining pages using the nomacs toolkit.
The first section of this guide describes the manual creation of a table structure in Transkribus and the transcription of the text it contains. The second section gives instructions on working with table templates, which were created in Transkribus, and how to apply them to several pages using a method called batch processing in the nomacs tool.
Finally, this document also explains how tables can be exported for further data processing in standard spreadsheet tools.
You can now segment your table into rows and columns
Note: Depending on the layout of your table, you might want to treat the spine of the book like an extra column (as in Figure 1). You can also mark-up this column on table cell level using the “book-binding” tag in the “Metadata/structural” tab.
In some cases, it may be necessary to merge cells together in order to reflect cells spanning multiple rows or columns.
If you focus on having the perfect table segmentation, it may also be necessary to correct the shapes of some of the cells in your table. The segmented green lines should then correspond to the lines of your table as far as possible. In order to do so,
Note: For export and automatic processing, having straight, rectangular lines close to the original table borders is perfectly sufficient.
Cell borders (graphical lines) need to be marked when they are visible.
Note: You can choose multiple cells at once by choosing “Select all cells” or “Select row cells”. Selecting or de-selecting cells works by holding down the command (Ctrl) key and clicking on a different cell.
The next step is to add baselines to your table. The baselines should reflect the logical flow of text and can therefore run over the cell borders if necessary.
Note: the line finding tool created by the Computational Intelligence Technology Lab at Universität Rostock is currently the most effective for the automatic recognition of baselines in tables. In the “Layout Analysis” section of the “Tools” tab click “Method: CITlab Advanced”.
You may find that the automatic layout tool on table cells strictly obeys the cell borders. Baselines stretching multiple cells are divided. You can use the merging tool to combine those partial baselines. In case you want to merge baselines stretching more than one cell, move them first to the same cell, select them and use the merge tool
Especially for given forms or tables, the headers remain the same over several pages. Any transcribed information contained in the table template will automatically be carried over by the table-matching tool.
Once you have segmented and transcribed a page, you can export the results of your transcribed tables into XLS format.
Note: Only tables and their contents are exported, text regions will be ignored. If your selection of pages contains no tables, Transkribus will show you an error message and stop the export process.
Usual book layouts show many consecutive pages using the same layout. You can use the nomacs toolkit to transfer one template page of segmentation (table borders and header baselines including header transcriptions) to subsequent pages.
In order to use the semi-automated methods, follow the segmentation section of this document to mark up the basic structure of your table on the first page.
Your template page (usually the first page showing the typical table layout and structure) should contain the following elements:
Note: With the existing version of the table matching tool, only one table per page is allowed. This restriction will be discarded in the future.
Note: The template matching is based on the recurrent layout of the table. Thus, e.g. manually drawn row separators can result in a different number of rows and should not be marked. You can run automatic layout analysis only on the header cells. You can do so by right-clicking on the first header and selecting “Select row cells” from the context menu. Untick the “Find Text Regions” in the “Layout Analysis” section of the “Tools” tab and the baselines of the selected cells will be found.
Note: If your document contains hand-drawn table cell borders, we recommend to leave the non-header table rows in the template as one large, artificial table row. The number of rows on each page might vary, so this influences the table matching process and quality. Instead, you should add horizontal lines to your tables after running through the batch processing.
In order to work with the nomacs tool, you need to export your Transkribus document first.
Note: In order to synchronize the results of your batch processing back to Transkribus, the filenames must remain unchanged.
Note: The threshold parameters allow the table matching to be more “flexible”; this means that rows or columns can have varying width, etc. A value of 0 means that the row/column width and height must match perfectly with the template. Higher values allow a greater variation.
At the time of writing, nomacs and Transkribus are not directly connected, so you have to synchronize your new XML files back to Transkribus.
You can easily do so by clicking ☰ in the Main menu and choosing “Document””Sync local transcriptions with doc…” in Transkribus.
Once you have finished synchronizing your table layout, you can add the remaining layout structures and correct table borders where necessary.
Once your table segmentation and table cell segmentation is finished, you can add baselines and transcriptions. You can also export your table into Excel format. For further details, check the corresponding sections of this guide.
We would like to thank the many users who have contributed their feedback to help improve the Transkribus software.
Transkribus is made available to the public as part of the H2020 e-Infrastructure Project READ (Recognition and Enrichment of Archival Documents) which received funding from the European Commission under grant agreement No. 674943.