Development of Hybrid Image Caption Generation Method using Deep Learning
IEEE BASE PAPER TITLE:
Development of Hybrid Image Caption Generation Method using Deep Learning
(or)
OUR PROPOSED PROJECT TITLE:
From Pixels to Text: Deep Learning Approach for Image Caption Generation
IEEE BASE PAPER ABSTRACT:
The image captioning is a process of generating descriptive sentence for a given image in human understandable language and such sentence is known as caption of the image. The automatic image caption generated is a result of deep analysis performed on the image which involves detecting objects in the image as well as the relationship between them. The generated caption should be meaningful and related to the context of the image. Image captioning techniques are the most researched area. It involves expertise of computer vision (CV), natural language processing (NLP) and artificial intelligence (AI). This paper proposes a novel hybrid approach for higher accuracy of image captioning. A detailed review of traditional methods and methods based on deep learning developed for image captioning are analyzed. The different method has significantly different Bilingual Evaluation Understudy (BLEU) scores on similar images. Thus, the hybrid approach is developed to get high BLEU score for each input images. The data set generation, implementation of hybrid approach, and the challenges along with the future work are discussed.
PROJECT OUTPUT VIDEO:
ALGORITHM / MODEL USED:
ResNet50 Architecture + LSTM.
OUR PROPOSED ABSTRACT:
In this project, we present an innovative approach to generate descriptive captions for images using deep learning techniques. The project is implemented in Python, utilizing the powerful combination of the ResNet50 architecture for image feature extraction and LSTM (Long Short-Term Memory) for caption generation.
Our model achieved an impressive overall loss of 2.57, with an accuracy ranging between 67% to 70%. The project leverages the popular and widely used Flickr 8k Dataset, which consists of 8,000 images, each accompanied by human-annotated captions. This dataset provides a diverse and rich source of images and captions for training and evaluation purposes.
To begin, we employ the ResNet50 architecture, a deep convolutional neural network, to extract high-level visual features from the input images. These features serve as a crucial foundation for understanding the content and context of the images. Next, we utilize LSTM, a type of recurrent neural network, to generate captions based on the extracted image features. LSTM models excel at capturing the sequential dependencies and linguistic structures necessary for generating coherent and contextually relevant captions.
Throughout the project, we meticulously train and fine-tune the model using the Flickr 8k Dataset, ensuring that it learns to associate the visual features with the corresponding textual descriptions. We employ a combination of loss functions and optimization techniques to guide the learning process, aiming to minimize the overall loss and enhance the accuracy of caption generation. Through extensive experimentation and iterative refinement, our model achieves remarkable results, generating captions that accurately capture the essence of the images.
The overall loss of 2.57 indicates the model’s ability to produce captions with high fidelity to the ground truth annotations. The successful implementation of this project demonstrates the potential of deep learning in the domain of image caption generation. The generated captions have the potential to enhance accessibility, aid visually impaired individuals in perceiving visual content, and enrich the user experience in various applications, including content indexing and retrieval, social media platforms, and autonomous systems.
By combining the ResNet50 architecture for image feature extraction with LSTM for caption generation, we achieve remarkable accuracy and overall loss metrics. This project contributes to the ever-evolving field of deep learning and opens avenues for further advancements in AI-driven image understanding and communication.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
- System : Pentium i3 Processor.
- Hard Disk : 500 GB.
- Monitor : 15’’ LED
- Input Devices : Keyboard, Mouse
- Ram : 6 GB
SOFTWARE REQUIREMENTS:
- Operating system : Windows 10 / 11.
- Coding Language : Python 3.8.
- Web Framework : Flask.
- Frontend : HTML, CSS, JavaScript.
REFERENCE:
Anuja Namdev; S.R.N. Reddy, “Development of Hybrid Image Caption Generation Method using Deep Learning”, 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE Conference, 2023.