Thanks to increasingly powerful storage media, multimedia resources have become nowadays essential resources, in the field of information and broadcasting (News Agency, INA), culture (museums), transport (monitoring), environment (satellite images), or medical imaging (medical records in hospitals). Thus, the challenge is how to quickly find relevant information. Therefore, research in multimedia is increasingly focused on indexing and retrieval techniques. To accomplish this task, the text within images and videos can be a relevant key. The challenges of recognizing text in images and videos are many: poor resolution, characters of different sizes, artifacts due to compression and effects of anti-recovery, very complex and variable background. There are four steps for the recognition of the text: (1) detecting the presence of the text, (2) localizing of the text, (3) extracting and enhancing the text area, and finally (4) recognizing the content of the text. In this work we will focus on this last step and we assume that the text box has been detected, located and retrieved correctly. This recognition module can also be divided into several sub-modules such as a binarization module, a text segmentation module, a character recognition module. We focused on a particular machine learning algorithm called convolutional neural networks (CNNs). These are networks of neurons whose topology is similar to the mammalian visual cortex. CNNs were initially used for recognition of handwritten digits. They were then applied successfully on many problems of pattern recognition. We propose in this thesis a new method of binarization of text images, a new method for segmentation of text images, the study of a convolutional neural network for character recognition in images, a discussion on the relevance of the binarization step in the recognition of text in images based on machine learning methods, and a new method of text recognition in images based on graph theory.
Image and video text recognition using convolutional neural networks
© TELECOM ParisTech. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
PERMALINK : https://www.eurecom.fr/publication/2707