Image Captioning with Convolutional Neural Networks and Autoencoder-Transformer Model


  • Selvani Deepthi Kavila Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India
  • Moni Sushma Deep Kavila Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India
  • Kanaka Raghu Sreerama Department of AI & ADS, GST, GITAM University, Visakhapatnam, Andhra Pradesh, India
  • Sai Harsha Vardhan Pittada Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India
  • Krishna Rupendra Singh Department of Computer Science and Engineering, Vignan’s Institute of Engineering for Women, Visakhapatnam, Andhra Pradesh, India
  • Badugu Samatha Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Green Fields, Vaddeswaram, Andhra Pradesh, India
  • Mahanty Rashmita Department of Basic Sciences and Humanities, Vignan’s Institute of Engineering for Women, Visakhapatnam, Andhra Pradesh, India



Image Captioning, Deep Learning, Transformers, Autoencoders, Convolutional Neural Networks, Machine Learning


This study deals with emerging machine learning technologies, deep learning, and Transformers with autoencode-decode mechanisms for image captioning. This study is important to provide in-depth and detailed information about methodologies, algorithms and procedures involved in the task of captioning images. In this study, exploration and implementation of the most efficient technologies to produce relevant captions is done. This research aims to achieve a detailed understanding of image captioning using Transformers and convolutional neural networks, which can be achieved using various available algorithms. Methods and utilities used in this study are some of the predefined CNN models, COCO dataset, Transformers (enc-BERT,dec-GPT) and machine learning algorithms which are used for visualization and analysis in the area of model’s performance which would help to contribute to advancements in accuracy and effectiveness of image captioning models and technologies. The evaluation and comparison of metrics that are applied to the generated captions state the model's performance.


Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision (ECCV), pp. 382-398.

Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning to align and translate. ICLR 2015.

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Doha, Qatar. Association for Computational Linguistics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186.

Gers, F.A., Schmidhuber, J., & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12, 2451-2471.

Hochreiter, S., &Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),1735-1780.

Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4565-4574.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio (2015). Show, attend and tell: Neural image caption generation with visual attention. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37, 2048-2057.

Keerthana, B., Vamsinath, J., Kumari, C. S., Appaji, S. V. S., Rani, P. P., & Chilukuri, S. (2024). Machine Learning Techniques for Medicinal Leaf Prediction and Disease Identification. International Journal of Experimental Research and Review, 42, 320–327.

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ICLR 2015

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84 – 90.

Lavie, A., & Agarwal, A. (2007). METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, pp. 228-231.

Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 25–26.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318.

Rao, M. S., Sekhar, C., & Bhattacharyya, D. (2021). Comparative analysis of machine learning models on loan risk analysis. In Advances in intelligent systems and computing, pp. 81–90.

Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1195.

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4278-4284.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 1-11.

Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566-4575.

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156-3164.

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., & Gao, J. (2020). Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 13041-13049.



How to Cite

Kavila, S. D., Kavila, M. S. D., Sreerama, K. R., Pittada, S. H. V., Singh, K. R., Samatha, B., & Rashmita, M. (2024). Image Captioning with Convolutional Neural Networks and Autoencoder-Transformer Model. International Journal of Experimental Research and Review, 46, 297–304.


