In this article, we will take a closer look at one of the approaches to automated handwritten text-recognition: PyLaia. It is a successor to Laia, built in 2018 by Joan Puigcerver and Carlos Mocholí. In the most simplified description, it takes an image of text as input and generates the corresponding characters as output. At the core of this algorithm lies a deep neural network, but its components are slightly more sophisticated than those of a simple perceptron (see our intro to neural networks here). The entire model architecture is summarised in the image below, but to understand it we first need to take a closer look at some of the biggest advancements in deep learning in the past decade.
Convolutional Networks
The first stage consists of so-called Conv. Blocks, or convolutional blocks. We already talked about these in another article, but let us take a closer look at what is actually happening. In a convolutional layer, the input image is convolved with a kernel, whichis essentially just a matrix of numbers. Since the raw image data can be thought of as a matrix of numeric pixel-values as well, the whole operation of convolution boils down to a huge number of matrix multiplications, where the kernel slides over the entire image – column by column and row by row – to produce a new set of pixel values. In the old days of image processing, people would manually design these kernels to achieve specific tasks, for example edge detection. The kernel below does exactly that: it detects vertical edges by sweeping over all 3×3 pixel sub-images contained in the original image. Traditionally, this could be the first step in a complicated, manually coded algorithm that tries to find objects in an image. However, this used to be so difficult that even the most sophisticated algorithms could not reliably differentiate images of cats and dogs.
Image source: Wikimedia
Then, approximately 10 years ago, everything changed. The ImageNet project, which runs one of the largest image recognition contests, was suddenly seeing enormous progress among its top contenders. In this contest, algorithms must detect several thousand classes of objects in a dataset that encompasses millions of hand-labelled images. One of the most famous winners, AlexNet, is a deep neural network that gives probability estimates for every single class. It famously managed to correctly classify 63% of all images (i.e. giving the correct class the highest probability) and for nearly 85% of all images, the true class was at least among the top 5 of AlexNet’s predictions. Such numbers were deemed completely impossible just a few years before. So how was this possible? It turned out that you can actually designate kernels as tunable parameters inside a neural network, essentially allowing the model to just learn the right matrices when given sufficient training data.
By using GPU accelerated computing, these kinds of neural networks could suddenly learn in a few hours what humans couldn’t figure out in many decades. Over the next couple of years, the convnet architectures got more and more refined. One big part was the introduction of so-called residual nets, where layer outputs were sometimes bypassing layers only to be fed in again at a later point. This greatly improved training efficiency as networks became deeper and deeper in order to comprehend higher levels of abstraction. Eventually, these developments gave rise to models that could beat average humans in a wide range of image recognition tasks. Today, accurate image classification using convolutional nets is becoming almost trivial. Thanks to modern software libraries and GPU vendors jumping on the bandwagon, image classification can almost be seen as a solved problem. Additionally, thanks to concepts like fine-tuning and transfer learning, it is even possible to solve problems with comparatively tiny amounts of training data. So what are we still trying to achieve when it comes to text recognition? If we only had to deal with pictures of individual letters or numbers, our story would end here. Correctly labelling them is indeed trivial. But handwritten text is more complicated. First of all, there are almost infinitely many handwriting styles and models usually don’t transfer well between texts from different authors. On top of that, even to the trained human eye certain characters might only make sense in the context of a word or perhaps a whole sentence. Thus, to get more accurate HTR predictions, we must go beyond mere image classification and also look at natural language processing. In part 2 of this series, we will take a closer look at the other great advancements in deep learning: recurrent neural networks.