+ Printed vs. handwritten text lines – automatically separated

September 26, 2019
Events

The Transkribus team collaborates with the Pattern Recognition team of the University Erlangen-Nürnberg (also member of READ-COOP SCE) and the collegues were so great to make an interesting experiment: to train their classifier for discriminating printed and handwritten text lines automatically. There are mainly two use-cases: (1) to improve recognition results if specific HTR models are applied to specific script types. However, we made the experience that the HTR engines usually can deal rather well with a large number of scripts internally so the actual benefit may not be as high as one expects. (2) to find handwritten lines in printed books. E.g. if famous persons made notes in their private books the tool which is described below will find them with amazing accuracy!

The following text was provided by Matthias Seuret and Vincent Christlein from the Pattern Recognition team and slightly adapted for this post:

The difficulty in the classification of text lines as being printed or handwritten does not lie much in the usage of the convolutional neural networks (CNN) or the design of their architecture, but in the acquisition and preparation of the data. Indeed, modern artificial neural networks (ANN) are now able to deal with highly complex data (such as ImageNet, which includes 90 different dog breeds to discriminate), and for a large variety of tasks, presenting enough examples to the ANN is sufficient to make it reach a fair accuracy.

It is necessary to note that ANNs (and other artificial intelligence systems) are extremely biased by the data used to train them. Because of this, the training data should be chosen carefully to ensure that the easiest way to classify the images correctly is to solve the task. For example, if all (or most) handwritten text lines are on a yellowish paper, while printed material is on white paper, then the ANN will simply learn to separate yellow from white, and will answer that any text line printed on yellowish paper is handwritten. Of course, an ANN can learn various other undesired data properties, such as the image resolution and quality, the texture of the paper, or the colour or contrast of the ink. Thus, it is of an utmost importance to use training data as similar as possible to what the ANN will have to deal with.

The system we developed for this task is based on the (printed) font groups classifier developed for the OCR-D project (http://www.ocr-d.de/). It consists of a DenseNet-121 wrapped in some utility classes, and has been adapted for the binary classification of handwritten and printed text. The DenseNet-121 is a convolutional neural network with 121 layers, most of them being in 4 blocks densely connected. It has however a relatively small amount of parameters for a network of its size, and thus requires less data to be trained than architectures with more parameters.

Machine learning schema for printed vs. handwritten text lines

Text lines are pre-processed in two ways. First, all of them are resized to a height of 150 pixels, and their aspect ratio is preserved. This is helpful for the ANN, as it will not have to learn dealing with a large diversity of text size. Second, data augmentation methods applied to the training images. This means that some small modifications, such as shearing or hue modification, are applied to the training images every time they are shown to the neural network during the training. The goal is to make the network learn to ignore these variations and perform on unseen data.

We trained our network on text lines coming from two different sources. Approximately 40’000 printed text lines were extracted automatically from the dataset presented in “Dataset of Pages from Early Printed Books with Multiple Font Groups” (https://doi.org/10.5281/zenodo.3366685), and 9’577 handwritten samples provided by READ. Also, 1’562 text lines from each class were used for test purpose – none of them came from a page used for the training data. While our network reached a classification accuracy of 97.5% on the test data, one has to keep in mind that this holds true only for this specific data. The source code of our method and the trained CNN, as well as code allowing anybody to easily retrain the CNN on their own data, is available at the following address: https://github.com/seuretm/printed-vs-handwritten

Note: If you are interested to create training data for this purpose in Transkribus you can use the “Structural tagging” feature and mark lines as “handwritten” or “printed” in your documents. The actual classifier needs to run outside of Transkribus, however, if there is some strong support from the user community we are happy to include the tool also in the Transkribus platform.

SHARE THIS ARTICLE

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

Some Transkribus projects finish with a complete digitised collection in Transkribus. Some take that digitised source and use it to ...

June 12, 2024

News, Transkribus

What is Carolingian Minuscule?

When you think of Carolingian (or Caroline) minuscule, Charlemagne and his vast Carolingian empire likely come to mind. While the ...

May 14, 2024

Uncategorized

AI models for reading Polish cursive and printed texts

Understanding historical documents is key to understanding history. But understanding historical documents in Polish can be a challenge. Not only ...

Cookie	Description	Duration
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.	1 hour
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.	1 year

Cookie	Description	Duration
VISITOR_INFO1_LIVE	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.	5 months
IDE	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.	2 years

Cookie	Description	Duration
GPS	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location	30 minutes
tk_or	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	5 years
tk_r3d	The cookie is installed by JetPack. Used for the internal metrics fo user activities to improve user experience	3 days
tk_lr	This cookie is set by JetPack plugin on sites using WooCommerce. This is a referral cookie used for analyzing referrer behavior for Jetpack	1 year
_ga	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.	2 years
_gid	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.	1 day
matomo	For statistical analysis, we use “Matomo” on this website. This is an open source tool for web analysis. Matomo does not transmit data to servers outside the control of the READ-COOP. Matomo is deactivated when you visit our website. Only if you actively consent will your usage behaviour be recorded anonymously.	1 year

Cookie	Description	Duration
YSC	This cookies is set by Youtube and is used to track the views of embedded videos.	1 year
_gat	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.	1 minute

+ Printed vs. handwritten text lines – automatically separated

Recent Posts

Mapping the concerts of Beethoven and Haydn: the “Concert Life in Vienna” project

What is Carolingian Minuscule?

AI models for reading Polish cursive and printed texts

The COOP

Products & Services

Useful information

Helpful resources

Community