Image Captioning with Convolutional Neural Networks and Autoencoder-Transformer Model
DOI:
https://doi.org/10.52756/ijerr.2024.v46.023Keywords:
Image Captioning, Deep Learning, Transformers, Autoencoders, Convolutional Neural Networks, Machine LearningAbstract
This study deals with emerging machine learning technologies, deep learning, and Transformers with autoencode-decode mechanisms for image captioning. This study is important to provide in-depth and detailed information about methodologies, algorithms and procedures involved in the task of captioning images. In this study, exploration and implementation of the most efficient technologies to produce relevant captions is done. This research aims to achieve a detailed understanding of image captioning using Transformers and convolutional neural networks, which can be achieved using various available algorithms. Methods and utilities used in this study are some of the predefined CNN models, COCO dataset, Transformers (enc-BERT,dec-GPT) and machine learning algorithms which are used for visualization and analysis in the area of model’s performance which would help to contribute to advancements in accuracy and effectiveness of image captioning models and technologies. The evaluation and comparison of metrics that are applied to the generated captions state the model's performance.
References
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision (ECCV), pp. 382-398. https://doi.org/10.48550/arXiv.1607.08822
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural machine translation by jointly learning to align and translate. ICLR 2015. https://doi.org/10.48550/arXiv.1409.0473
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166. https://doi.org/10.1109/72.279181
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Doha, Qatar. Association for Computational Linguistics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.48550/arXiv.1406.1078
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. https://doi.org/10.48550/arXiv.1810.04805
Gers, F.A., Schmidhuber, J., & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12, 2451-2471. https://doi.org/10.1162/089976600300015015
Hochreiter, S., &Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4565-4574. https://doi.org/10.1109/CVPR.2016.494.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio (2015). Show, attend and tell: Neural image caption generation with visual attention. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37, 2048-2057. https://dl.acm.org/doi/10.5555/3045118.3045336
Keerthana, B., Vamsinath, J., Kumari, C. S., Appaji, S. V. S., Rani, P. P., & Chilukuri, S. (2024). Machine Learning Techniques for Medicinal Leaf Prediction and Disease Identification. International Journal of Experimental Research and Review, 42, 320–327. https://doi.org/10.52756/ijerr.2024.v42.028
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ICLR 2015 https://doi.org/10.48550/arXiv.1412.6980
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84 – 90. https://doi.org/10.1145/3065386
Lavie, A., & Agarwal, A. (2007). METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, pp. 228-231. https://dl.acm.org/doi/10.5555/1626355.1626389
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 25–26. https://aclanthology.org/W04-1013.pdf
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. https://doi.org/10.48550/arXiv.1310.4546
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. https://doi.org/10.3115/1073083.1073135
Rao, M. S., Sekhar, C., & Bhattacharyya, D. (2021). Comparative analysis of machine learning models on loan risk analysis. In Advances in intelligent systems and computing, pp. 81–90. https://doi.org/10.1007/978-981-15-9516-5_7
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1195. https://doi.org/10.1109/CVPR.2017.131
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4278-4284. https://dl.acm.org/doi/10.5555/3298023.3298188
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 1-11. https://doi.org/10.48550/arXiv.1706.03762
Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566-4575. https://doi.org/10.1109/CVPR.2015.7299087
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156-3164. https://doi.org/10.48550/arXiv.1411.4555
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., & Gao, J. (2020). Unified vision-language pre-training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 13041-13049. https://doi.org/10.1609/aaai.v34i07.7005
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 International Academic Publishing House (IAPH)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.