Image Caption Generation and Comprehensive Comparison of Image Encoders

Shitiz Gupta 1 * , Shubham Agnihotri 2 , Deepasha Birla 3 , Achin Jain 4 , Thavavel Vaiyapuri 5 , Puneet Singh Lamba 6

  • 1 Bharati Vidyapeeth’s College of Engineering, New Delhi, India - (guptashitiz17@gmail.com)
  • 2 Bharati Vidyapeeth’s College of Engineering, New Delhi, India - (skagnihotri1@gmail.com)
  • 3 Bharati Vidyapeeth’s College of Engineering, New Delhi, India - (birladeepasha99@gmail.com)
  • 4 Bharati Vidyapeeth’s College of Engineering, New Delhi, India - (achin.mails@gmail.com)
  • 5 College of computer engineering and sciences, Prince Sattam bin abdulaziz University, Saudi Arabia - (t.thangam@psau.edu.sa)
  • 6 Bharati Vidyapeeth’s College of Engineering, New Delhi, India - (singhs.puneet@gmail.com)
  • Doi: https://doi.org/10.54216/FPA.040202

    Received: March 01, 2021 Accepted July 27, 2021

    Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.

    Keywords :

    Image Captioning, Transfer Learning, CNN (Convolutional Neural Network), RNN (Recurrent neural network)and LSTM (Long Short Term Memory).


    Cite This Article As :
