Newspapers are an invaluable source of information for historians. Not only do they provide a chronicle of events described by people living at the time, but they also allow researchers to study long-term trends, from the number of cholera outbreaks in a certain city to public opinion on a certain topic.
Previously, if a researcher wanted to access a collection of newspapers, they would first have to go to the library or archive where it was stored and search manually through the collection to find the information they need. But technology such as Transkribus has revolutionised this process. Transkribus uses AI to automatically transcribe documents such as newspapers and create digital versions of them. These digital versions can then be easily searched for specific keywords or information, making it much easier for historians to find the information they need.
However, unlike other materials such as letters or books, newspapers present certain challenges for text recognition platforms. In this post, we would like to look at how best to transcribe newspapers with Transkribus, so that you can achieve the most accurate transcriptions possible.
Why are newspapers so difficult for Transkribus?
In general, text recognition platforms find it easier to transcribe printed texts than handwritten texts. So in theory, the printed text in newspapers shouldn’t be too difficult to transcribe.
However, it is not the text that makes newspapers a challenge, but the layout. Before Transkribus can start with the Text Recognition, it first performs a Layout Recognition: in other words, it detects which parts of the page contain text and where those individual lines of text start and end. It is these blocks and lines of text that are then transcribed. That means that if the Layout Recognition is done incorrectly, then Transkribus won’t know which parts of the page to transcribe and will therefore produce an inaccurate transcription.
Let’s take some real-life examples. In the document below, the text is organised in one big block and into regular lines. This kind of layout is pretty easy for Transkribus to recognise and so the Layout Recognition looks like this:
© Diary of Marjory Fleming, National Library for Scotland, Public domain, via National Library for Scotland
As you can see, each line of text has been correctly underlined with a blue line. Because the Layout Recognition is accurate, Transkribus knows exactly which parts of the text to transcribe and is, therefore, able to produce an accurate transcription:
© Diary of Marjory Fleming, National Library for Scotland, Public domain, via National Library for Scotland
A newspaper, however, has a much more complicated layout. The text isn’t just in one block but is divided into several blocks in several columns, along with headlines, the price and other irregular elements. This kind of layout is much harder for Transkribus to detect, and so the Layout Recognition can end up looking like this:
© Berliner Tageblatt – 1927-04-05, Staatsbibliothek zu Berlin, Public domain, via Europeana
Instead of the blue lines neatly underlining each line of text, they are haphazardly spread across the text at strange angles. It is clear that Transkribus does not know where the text actually is on the page and therefore it won’t be able to provide an accurate transcription, as the below image shows:
© Berliner Tageblatt – 1927-04-05, Staatsbibliothek zu Berlin, Public domain, via Europeana
This transcription is of little use to a historian. However, it was also created using Transkribus’ default Layout Recognition settings. By changing those settings, we can achieve much better results.
How to improve Layout Recognition with newspapers
Making the Layout Recognition more suitable for newspapers is a two-step process. First, you need to detect the page structure with the Printed Block Detection method. Then you need to manually configure the Layout Recognition settings so that they can recognise newspapers more effectively.
Please note: normally, the Layout Recognition is done automatically as part of the Text Recognition process. To do both of the steps above, you need to run the Layout Recognition as a separate step before doing the Text Recognition, as described in the instructions below.
Printed Block Detection
The Printed Block Detection method is a way of manually showing Transkribus where the individual blocks of text are on a page. In the case of a newspaper, each block normally contains one article. To run the Printed Block Detection method, you need to:
- Select the page(s) you want to transcribe.
- Click “Layout Recognition” on the left-side menu.
- Select the Printed Block Detection method and start the recognition. This will divide your page into several blocks, as shown in the video below.
- You can then manually adjust the blocks to ensure that they exactly fit the layout of the page.
Changing the Layout Recognition settings
Once Transkribus has successfully detected the blocks of text on the page, you can then run the Layout Recognition in full:
- Select the page(s) you want to transcribe.
- Select “Text Recognition” from the left-side menu.
- Select “Layout” from the drop-down menu at the top.
- Click “Public Models” and select “Mixed Text Line Orientation”
- Click “Configure” and change the settings as shown below.
- You can then manually adjust the lines so that they correctly underline each line of text.
Generation of Text Regions (Layout Blocks) | Keep existing |
Image Scaling | Upscale |
(click on Baseline Options) | |
Minimal Baseline Lengh | Low |
Baseline Accuracy Threshold | High |
Use Trained Separators | No |
Max distance for merging baselines | Medium |
Split Lines on Regions border | Yes |
The video below shows these steps in full:
How your newspaper should look after the improved Layout Recognition
Using Transkribus’ default Layout Recognition settings with a newspaper produced blue squiggles all over the page. However, by following the steps above, Transkribus was able to recognise the layout of the newspapers and mark each block and line of text correctly:
© Berliner Tageblatt – 1927-04-05, Staatsbibliothek zu Berlin, Public domain, via Europeana
Now that Transkribus actually knows where the text is on the page, it can now transcribe it properly. Remember this transcription?
© Berliner Tageblatt – 1927-04-05, Staatsbibliothek zu Berlin, Public domain, via Europeana
It now looks like this:
© Berliner Tageblatt – 1927-04-05, Staatsbibliothek zu Berlin, Public domain, via Europeana
Of course, some post-editing may be required depending on the material. But in general, following these steps should be able to give you an automatic transcription of a high enough standard for most research purposes.
Additional tips and tricks
There are a few other things you can do to make newspaper transcriptions easier.
- Ensure you have good-quality images. In general, the better the quality of the image, the better the quality of the transcription. If your scans are blurry or have marks or other “noise” on them, we recommend taking some new scans with good lighting conditions.
- In some cases, it can also help to double the size of your scans before uploading them to Transkribus.
- The Layout Recognition settings described above were the ones that we found to be most effective for the majority of newspapers. However, depending on your specific newspaper, it may be worth trying different configurations of settings to see what works best for your particular layout.
- If you do decide to try out different settings, we recommend doing this on just a few test pages first. Once you have found a combination of settings that works for you, you can then run the Layout Recognition on the whole document or collection.
Further resources
We hope this guide gives you a good insight into how to transcribe newspapers effectively with Transkribus. For more information, check out our page on transcribing newspapers in the Transkribus Help Center.