This is a project report from course “Neural Networks”.
Authors: Shan Wu, Joonas Lõmps
Optical character recognition (OCR) is the conversion of images of typed, handwritten, or printed text into machine-encoded text. Automating the digitalization of handwritten characters offers a lot of advantages when it comes to processing large amounts of papers, documents, and transforming or transferring data into computers.
1. Idea
We do OCR on one of the Japanese alphabet, Hiragana. It consists of 71 different characters which are almost three times the number in the English alphabet. In addition to having more characters, the characters themselves are more complex making it harder to distinguish between them. We have a dataset of 12211 images of Japanese Hiragana alphabet, approximately 172 of each in a total of 28 different handwritings.
We explored two different directions:
- Simple one character model tried to achieve as high an accuracy as we manage for single recognition. (left)
- Train a model to be able to recognize characters from an image with multiple characters. (right)
2. Single character recognition
Idea is to teach the machine what each character looks like. We wanted to test out some of the state-of-the-art predefined networks used in optical character recognition as well as try to come up with a network ourselves. Our search led us to networks VGG-19 with batch normalization [1] and Inception v3 [2].
2.1 Preprocessing
We did a very simple preprocessing that consists of the following:
- Resize the input image to the network’s input size and if needed add missing dimensions as our data is in grayscale and 50×50 pixel format.
- Normalize the values of the pixels (0-mean, std at 1)
The input sizes we used for each network (Height, Width, Channel):
- VGG-19: 256x256x3
- Inception v3: 299x299x3
- Customized: 50x50x1
2.2 Experiments and results
To make our lives a bit easier we are using PyTorch library to handle the neural network side.
A quick overview of the parameters used in training and the results achieved are as follows:
All the models achieve validation accuracy over 98% which is great, but when looking closer we can start to see the differences. The first thing we can notice is the number of parameters in the model. It can be seen that more is not always better, as Inception v3 managed to get significantly better results with more than 5 times less trained parameters. On the other hand, if you are after the speed of inference you might want to to use something customized.
3. Multi Character Recognition
In this part, we focus on teaching the network how to recognize characters from images with multiple characters on them and then classify the recognized characters.
To achieve this we are using an object detection system called Faster R-CNN [3]. It is composed of two modules:
- A deep fully convolutional network that proposes regions of interests (RoI) or region proposal network (RPN).
- The Fast R-CNN detector [4] that used the proposed regions.
The output of the system is a list of RoI’s and an object classification for those RoI’s.
We used the same logic to train the system to be able to recognize the characters from images with multiple characters on them. We are using the ResNet50 [5] network model as our backbone.
3.1 Dataset
To do this we had to generate a new dataset with multiple characters on a single image with locations know to use. We came up with the following logic to generate new images:
Our characters are 50×50 (H, W), so if we want to n characters on the same images we need a canvas of a size (H*(n+1), W*(n+1)) to be sure that all the characters surely fit there with no overlay. The locations of the characters were chosen randomly and were checked for overlay with other characters. Also, the number of characters (between 3–10) and the characters themselves were chosen randomly.
3.2 Experiments and results
We tried to train the faster R-CNN with ResNet 50 as the backbone and a feature pyramid network (FPN) from scratch. Epoch size was set to 2, the learning rate for the optimizer was set to 1e-3, and all the other hyperparameters we left unchanged. As shown in the loss figures below, the training went on smoothly at the beginning. The sum of the total loss was decreasing like all other criteria, but the model suddenly failed after around 5000 iterations. All loss values dramatically rose to the peak and the model can not detect characters afterwards.
Next, we continued training after the model failed. The sum of the losses decreased again as well as the classifier. However, the objectness loss was high, which means the model can not detect characters anymore. Also, the ROI box regression loss and the RPN box regression loss failed to converge. The bounding boxes prediction failed completely.
Yet, we still managed to visualize some of the early predictions from the model. In the figures shown in the validation results, the ground truths are the green boxes while the red ones are the model’s predictions. At the early stage, the model predicts a lot of bounding boxes and the accuracy isn’t very high for most of them. It is because the model is not converged.
In the second training after the loss explosion, the model lost the ability to provide bounding boxes. The results showed there are only ground truths bounding boxes plotted.
In the above experiment, we tried to train the faster R-CNN from scratch using an end-to-end way (given inputs and targets, train the model iteratively). Nevertheless, the author of the faster R-CNN suggests using a four-stage method instead, which is much more complicated. It can be summarized in the following steps:
- Train RPN independently using batch size equals to 1. Initialize weights of the backbone from the pre-trained model, and fine-tune it for the regional proposal task. Calculate loss based on the IoU of the predicted bounding boxes and the ground truth boxes. IoU greater than 0.7 counts as a positive prediction while IoU under 0.3 is negative. All remaining are discarded.
- Train fast R-CNN also independently. Initialize weights of the backbone from the pre-trained model, and fine-tune it for the object detection task. Next, fix the backbone as well as the RPN, and train the fast R-CNN part using RPN’s proposals.
- Initialize weights of the RPN’s backbone from fast R-CNN’s backbone trained in the previous step. All layers between two backbones are fixed except the layers that are unique in the RPN. This is the finalized RPN.
- Using the RPN trained in the previous step, train the fast R-CNN while fixing all common layers. Only layers of the detector are trained.
In the end, this method gives a faster R-CNN with weight-shared backbones. Due to the limitation of time, we didn’t manage to train the model in this way, but it’s worth giving it a try in the future.
4. Conclusion
In this project, we train different state-of-the-art neural network models and a customized one to do optical handwritten character recognition. For single-character recognition, we achieve the desired results with accuracy over 99%.
For multi-character recognition, we generate a new dataset and try to train a faster R-CNN model to predict the bounding boxes of the characters, as well as the characters themselves. Unfortunately, we ran into some issues during training but do still managed to get some results out. We propose a way to fix the issue we ran into but do not manage to implement it yet.
5. Codes & Links
Our git repository: https://github.com/simonwu53/Hiragana-Recognition
References
- Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
- Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Ren, Shaoqing, et al. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems. 2015.
- Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE international conference on computer vision. 2015
- He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.