Words Matter: Scene Text for Image Classification and Retrieval
Words Matter: Scene Text for Image Classification and Retrieval
ABSTRACT:
Text in natural images typically adds meaning to an object or scene. In particular, text specifies which business places serve drinks (e.g. cafe, teahouse) or food (e.g. restaurant, pizzeria), and what kind of service is provided (e.g. massage, repair). The mere presence of text, its words and meaning are closely related to the semantics of the object or scene. This paper exploits textual contents in images for fine-grained business place classification and logo retrieval. There are four main contributions. First, we show that the textual cues extracted by the proposed method are effective for the two tasks. Combining the proposed textual and visual cues outperforms visual only classification and retrieval by a large margin. Second, to extract the textual cues, a generic and fully unsupervised word box proposal method is introduced. The method reaches state-of-the art word detection recall with a limited number of proposals. Third, contrary to what is widely acknowledged in text detection literature, we demonstrate that high recall in word detection is more important than high f-score at least for both tasks considered in this work. Last, this paper provides a large annotated text detection dataset with 10K images and 27601 word boxes.
PROJECT OUTPUT VIDEO:
EXISTING SYSTEM:
-
Most of the time, the stores use text to indicate what type of food (pizzeria, diner), drink (tea, coffee) and service (drycleaning, repair) that they provide. This text information is helpful even for human observers to understand what type of business place it is. For instance, the images of two different business places (pizzeria and bakery) have a very similar appearance. However, they are different types of business places.
-
It is only possible with text information to identify what type of business places these are. Moreover, text is also useful to identify similar products (logo) such as Heineken, Foster and Carlsberg.
-
The common approach to text recognition in images is to detect text first before they can be recognized. The state-of-the-art word detection methods focus on obtaining a high f-score by balancing precision and recall.
-
Existing word detection methods usually follow a bottom-up approach. Character candidates are computed by a connected component or a sliding window approach.
-
Candidate character regions are further verified and combined to form word candidates. This is done by using geometric, structural and appearance properties of text and is based on hand-crafted rules or learning schemes
DISADVANTAGES OF EXISTING SYSTEM:
-
Unfortunately, there exists no single best method for detecting words with high recall due to large variations in text style, size and orientation.
-
Weak classifiers are used.
-
Poor F-Score.
PROPOSED SYSTEM:
-
In this paper, we focus on classification of different business places, e.g., bakery, cafe and bookstore. Various business places have subtle differences in visual appearances.
-
We exploit the recognized text in images for fine-grained classification of business places. Automatic recognition and indexing of business places will be useful in many practical scenarios.
-
We propose a multimodal approach which uses recognized text and visual cues for fine-grained classification and logo retrieval.
-
We propose to combine character candidates generated by different state-of-the-art detection methods. To obtain robustness against varying imaging conditions, we use color spaces containing photometric invariant properties such as robustness against shadows, highlights and specular reflections.
-
The proposed method computes text lines and generates word box proposals based on the character candidates. Then, word box proposals are used as input of a state-of-the-art word recognition method to yield textual cues. Finally, textual cues are combined with visual cues for fine-grained classification and logo retrieval.
ADVANTAGES OF PROPOSED SYSTEM:
-
For instance, it can be used to extract information from Google street view images and Google Map can use the information to provide recommendations of bakeries, restaurants close to the location of the user.
-
Instead of using the f-score, our aim is to obtain a high recall. A high recall is required because textual cues that are not detected will not be considered in the next (recognition) phase of the framework.
-
The proposed method reaches state-of-the-art results on both tasks. Second, to extract the word-level textual cues, a generic, efficient and fully unsupervised word proposal method is introduced. The proposed method reaches state-of-the-art word detection recall with a limited number of proposals. Third, contrary to what is widely acknowledged in text detection literature, we experimentally show high recall in word detection is more important than high f-score at least for both applications considered in this work.
MODULES:
-
Word Level Textual Cue Encoding
-
Visual Cue Encoding
-
Classification and Retrieval
MODULES DESCRIPTION:
Word Level Textual Cue Encoding:
It having following steps,
-
Image Acquisition
-
Color Channel Generation
-
Character Detection
-
Word Proposal Generation & Word Recognition
1. Image Acquisition:
Images are acquired from Gallery.
2. Color Channel Generation:
In that stage, RGB image is converted into HSV image. After that Hue, Saturation and Intensity channels are extracted for further process. Especially intensity channel is used for character detection.
3. Character Detection:
For character detection, we proposed two methods such as MSER region detection and text saliency generation. V channel is used for MSER region detection. In that text region is not detected properly. Another method is saliency map generation for text detection. Finally text saliency was extracted.
4. Word Proposal Generation & Word Recognition:
Word detection and recognition done by using morphological operation and optical character recognition method. It has following steps,
Stage 1:
In this stage text saliency image acquisition and word region detection are performed. For the first, a text saliency image is taken as input. The image taken is in the RGB format. Then the image is converted to gray scale image. After converting the RGB image to gray image.
Stage 2:
In this stage word extraction and word segmentation is performed. For that morphological dilation and erosion operations are performed to fill holes. After applying morphological operations, local thresholding is applied to covert gray image into binary image. In order to get further contrast enhancement, intensity range of the pixel values are scaled between 0 to 1. In certain situations if some unwanted gaps and holes are present in the word region. Then region growing segmentation is performed to segment characters from the word region.
Stage 3:
In this stage word recognition is done using template matching. Each segmented character is matched with character templates stored in database. Finally the word was recognized.
Visual cue Encoding:
This stage is implemented for visual features extraction. SURF feature descriptor is used for visual features extraction. Strongest key points are extracted by SURF.
Classification and Retrieval:
The classification process is done over the recognized word and visual features. Based on the recognized word and features, classification and similar images retrieval are explored.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
-
System : Pentium Dual Core.
-
Hard Disk : 120 GB.
-
Monitor : 15’’ LED
-
Input Devices : Keyboard, Mouse
-
Ram : 1GB.
SOFTWARE REQUIREMENTS:
-
Operating system : Windows 7.
-
Coding Language : MATLAB
-
Tool : MATLAB R2013A
REFERENCE:
Sezer Karaogluy, Ran Taoy, Theo Gevers and Arnold W. M. Smeulders, “Words Matter: Scene Text for Image Classification and Retrieval”, IEEE Transactions on Multimedia, 2017.