Image Captioning with Convolutional Neural Networks and Autoencoder-Transformer Model

Selvani Deepthi Kavila; Moni Sushma Deep Kavila; Kanaka Raghu Sreerama; Sai Harsha Vardhan Pittada; Krishna Rupendra Singh; Badugu Samatha; Mahanty Rashmita

doi:10.52756/ijerr.2024.v46.023

Authors

Selvani Deepthi Kavila Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India https://orcid.org/0000-0001-5307-3113
Moni Sushma Deep Kavila Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India https://orcid.org/0009-0000-6457-3500
Kanaka Raghu Sreerama Department of AI & ADS, GST, GITAM University, Visakhapatnam, Andhra Pradesh, India https://orcid.org/0000-0003-1168-237X
Sai Harsha Vardhan Pittada Department of CSE (Artificial Intelligence & Machine Learning and Data Science) Anil Neerukonda Institute of Technology and Sciences(A), Visakhapatnam, Andhra Pradesh, India https://orcid.org/0009-0004-5871-1427
Krishna Rupendra Singh Department of Computer Science and Engineering, Vignan’s Institute of Engineering for Women, Visakhapatnam, Andhra Pradesh, India https://orcid.org/0009-0007-6402-9194
Badugu Samatha Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Green Fields, Vaddeswaram, Andhra Pradesh, India https://orcid.org/0000-0003-1353-2797
Mahanty Rashmita Department of Basic Sciences and Humanities, Vignan’s Institute of Engineering for Women, Visakhapatnam, Andhra Pradesh, India https://orcid.org/0000-0001-9247-8295

DOI:

https://doi.org/10.52756/ijerr.2024.v46.023

Keywords:

Image Captioning, Deep Learning, Transformers, Autoencoders, Convolutional Neural Networks, Machine Learning

Abstract

This study deals with emerging machine learning technologies, deep learning, and Transformers with autoencode-decode mechanisms for image captioning. This study is important to provide in-depth and detailed information about methodologies, algorithms and procedures involved in the task of captioning images. In this study, exploration and implementation of the most efficient technologies to produce relevant captions is done. This research aims to achieve a detailed understanding of image captioning using Transformers and convolutional neural networks, which can be achieved using various available algorithms. Methods and utilities used in this study are some of the predefined CNN models, COCO dataset, Transformers (enc-BERT,dec-GPT) and machine learning algorithms which are used for visualization and analysis in the area of model’s performance which would help to contribute to advancements in accuracy and effectiveness of image captioning models and technologies. The evaluation and comparison of metrics that are applied to the generated captions state the model's performance.

References

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision (ECCV), pp. 382-398. https://doi.org/10.48550/arXiv.1607.08822

Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning to align and translate. ICLR 2015. https://doi.org/10.48550/arXiv.1409.0473

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166. https://doi.org/10.1109/72.279181

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Doha, Qatar. Association for Computational Linguistics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.48550/arXiv.1406.1078

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. https://doi.org/10.48550/arXiv.1810.04805

Gers, F.A., Schmidhuber, J., & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12, 2451-2471. https://doi.org/10.1162/089976600300015015

Hochreiter, S., &Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4565-4574. https://doi.org/10.1109/CVPR.2016.494.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio (2015). Show, attend and tell: Neural image caption generation with visual attention. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37, 2048-2057. https://dl.acm.org/doi/10.5555/3045118.3045336

Keerthana, B., Vamsinath, J., Kumari, C. S., Appaji, S. V. S., Rani, P. P., & Chilukuri, S. (2024). Machine Learning Techniques for Medicinal Leaf Prediction and Disease Identification. International Journal of Experimental Research and Review, 42, 320–327. https://doi.org/10.52756/ijerr.2024.v42.028

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ICLR 2015 https://doi.org/10.48550/arXiv.1412.6980

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84 – 90. https://doi.org/10.1145/3065386

Lavie, A., & Agarwal, A. (2007). METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, pp. 228-231. https://dl.acm.org/doi/10.5555/1626355.1626389

Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 25–26. https://aclanthology.org/W04-1013.pdf

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. https://doi.org/10.48550/arXiv.1310.4546

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. https://doi.org/10.3115/1073083.1073135

Rao, M. S., Sekhar, C., & Bhattacharyya, D. (2021). Comparative analysis of machine learning models on loan risk analysis. In Advances in intelligent systems and computing, pp. 81–90. https://doi.org/10.1007/978-981-15-9516-5_7

Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1195. https://doi.org/10.1109/CVPR.2017.131

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4278-4284. https://dl.acm.org/doi/10.5555/3298023.3298188

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 1-11. https://doi.org/10.48550/arXiv.1706.03762

Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566-4575. https://doi.org/10.1109/CVPR.2015.7299087

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156-3164. https://doi.org/10.48550/arXiv.1411.4555

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., & Gao, J. (2020). Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 13041-13049. https://doi.org/10.1609/aaai.v34i07.7005