Optical Character Recognition on Hiragana

This is a project report from course “Neural Networks”.

Authors: Shan Wu, Joonas Lõmps

Optical character recognition (OCR) is the conversion of images of typed, handwritten, or printed text into machine-encoded text. Automating the digitalization of handwritten characters offers a lot of advantages when it comes to processing large amounts of papers, documents, and transforming or transferring data into computers.

1. Idea

We do OCR on one of the Japanese alphabet, Hiragana. It consists of 71 different characters which are almost three times the number in the English alphabet. In addition to having more characters, the characters themselves are more complex making it harder to distinguish between them. We have a dataset of 12211 images of Japanese Hiragana alphabet, approximately 172 of each in a total of 28 different handwritings.

We explored two different directions:

Simple one character model tried to achieve as high an accuracy as we manage for single recognition. (left)
Train a model to be able to recognize characters from an image with multiple characters. (right)

2. Single character recognition

Idea is to teach the machine what each character looks like. We wanted to test out some of the state-of-the-art predefined networks used in optical character recognition as well as try to come up with a network ourselves. Our search led us to networks VGG-19 with batch normalization [1] and Inception v3 [2].

VGG19 with batch normalization architecture

Inception v3 architecture

Customized CNN architecture with batch normalization

2.1 Preprocessing

We did a very simple preprocessing that consists of the following:

Resize the input image to the network’s input size and if needed add missing dimensions as our data is in grayscale and 50×50 pixel format.
Normalize the values of the pixels (0-mean, std at 1)

The input sizes we used for each network (Height, Width, Channel):

VGG-19: 256x256x3
Inception v3: 299x299x3
Customized: 50x50x1

2.2 Experiments and results

To make our lives a bit easier we are using PyTorch library to handle the neural network side.

A quick overview of the parameters used in training and the results achieved are as follows:

All the models achieve validation accuracy over 98% which is great, but when looking closer we can start to see the differences. The first thing we can notice is the number of parameters in the model. It can be seen that more is not always better, as Inception v3 managed to get significantly better results with more than 5 times less trained parameters. On the other hand, if you are after the speed of inference you might want to to use something customized.

Training & validation comparison for three CNNs

3. Multi Character Recognition

In this part, we focus on teaching the network how to recognize characters from images with multiple characters on them and then classify the recognized characters.

To achieve this we are using an object detection system called Faster R-CNN [3]. It is composed of two modules:

A deep fully convolutional network that proposes regions of interests (RoI) or region proposal network (RPN).
The Fast R-CNN detector [4] that used the proposed regions.

The output of the system is a list of RoI’s and an object classification for those RoI’s.

The RPN for region proposals and Fast R-CNN as a detector in the Faster R-CNN detection pipeline

We used the same logic to train the system to be able to recognize the characters from images with multiple characters on them. We are using the ResNet50 [5] network model as our backbone.

3.1 Dataset

To do this we had to generate a new dataset with multiple characters on a single image with locations know to use. We came up with the following logic to generate new images:

left — two 50×50 characters on a 150×150 canvas, right- four 50×50 characters on a 250×250 canvas

Our characters are 50×50 (H, W), so if we want to n characters on the same images we need a canvas of a size (H*(n+1), W*(n+1)) to be sure that all the characters surely fit there with no overlay. The locations of the characters were chosen randomly and were checked for overlay with other characters. Also, the number of characters (between 3–10) and the characters themselves were chosen randomly.

3.2 Experiments and results

We tried to train the faster R-CNN with ResNet 50 as the backbone and a feature pyramid network (FPN) from scratch. Epoch size was set to 2, the learning rate for the optimizer was set to 1e-3, and all the other hyperparameters we left unchanged. As shown in the loss figures below, the training went on smoothly at the beginning. The sum of the total loss was decreasing like all other criteria, but the model suddenly failed after around 5000 iterations. All loss values dramatically rose to the peak and the model can not detect characters afterwards.

Next, we continued training after the model failed. The sum of the losses decreased again as well as the classifier. However, the objectness loss was high, which means the model can not detect characters anymore. Also, the ROI box regression loss and the RPN box regression loss failed to converge. The bounding boxes prediction failed completely.

Yet, we still managed to visualize some of the early predictions from the model. In the figures shown in the validation results, the ground truths are the green boxes while the red ones are the model’s predictions. At the early stage, the model predicts a lot of bounding boxes and the accuracy isn’t very high for most of them. It is because the model is not converged.

In the second training after the loss explosion, the model lost the ability to provide bounding boxes. The results showed there are only ground truths bounding boxes plotted.

In the above experiment, we tried to train the faster R-CNN from scratch using an end-to-end way (given inputs and targets, train the model iteratively). Nevertheless, the author of the faster R-CNN suggests using a four-stage method instead, which is much more complicated. It can be summarized in the following steps:

Train RPN independently using batch size equals to 1. Initialize weights of the backbone from the pre-trained model, and fine-tune it for the regional proposal task. Calculate loss based on the IoU of the predicted bounding boxes and the ground truth boxes. IoU greater than 0.7 counts as a positive prediction while IoU under 0.3 is negative. All remaining are discarded.
Train fast R-CNN also independently. Initialize weights of the backbone from the pre-trained model, and fine-tune it for the object detection task. Next, fix the backbone as well as the RPN, and train the fast R-CNN part using RPN’s proposals.
Initialize weights of the RPN’s backbone from fast R-CNN’s backbone trained in the previous step. All layers between two backbones are fixed except the layers that are unique in the RPN. This is the finalized RPN.
Using the RPN trained in the previous step, train the fast R-CNN while fixing all common layers. Only layers of the detector are trained.

In the end, this method gives a faster R-CNN with weight-shared backbones. Due to the limitation of time, we didn’t manage to train the model in this way, but it’s worth giving it a try in the future.

4. Conclusion

In this project, we train different state-of-the-art neural network models and a customized one to do optical handwritten character recognition. For single-character recognition, we achieve the desired results with accuracy over 99%.

For multi-character recognition, we generate a new dataset and try to train a faster R-CNN model to predict the bounding boxes of the characters, as well as the characters themselves. Unfortunately, we ran into some issues during training but do still managed to get some results out. We propose a way to fix the issue we ran into but do not manage to implement it yet.

5. Codes & Links

Our git repository: https://github.com/simonwu53/Hiragana-Recognition

References

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Ren, Shaoqing, et al. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems. 2015.
Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE international conference on computer vision. 2015
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.