Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm

Journal of Intelligent Systems and Internet of Things JISIoT 2690-6791 2769-786X 10.54216/JISIoT https://www.americaspg.com/journals/show/4054 2019 2019 Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm Department of Computer Engineering, College of Computer Engineering & Sciences, Prince Sattam bin Abdulaziz University, Alkharj-11942, Saudi Arabia Suleman Suleman Department of Computer Engineering, College of Computer Engineering & Sciences, Prince Sattam bin Abdulaziz University, Alkharj-11942, Saudi Arabia Mohammed Altaf Ahmed In the image-captioning field, the excellence of produced captions is vital for the effectual interaction of visual content. Image Captioning is the main task, which unites computer vision (CV) and natural language processing (NLP), where it goals to produce graphic legends for images. A dual-fold procedure depends on precise image perception and alters language understanding both semantically and syntactically. It is gradually challenging to stay up with the modern study and consequences in image captioning owing to the developing amount of knowledge accessible on the topic. This analysis examines into deep learning (DL) to tackle the tasks challenged by individuals with graphic impairments, targeting to improve their visual insight via advanced technologies. By tradition, the visually impaired have trusted physical support and adaptive helps for understanding and navigating visual content. With the beginning of DL, there is a unique chance to develop this scenery. In this paper, we offer an Advanced Deep Learning Method for Image Captioning Based Using Customized Transformer with a Global Optimization Algorithm (ADLIC-CTGOA). The foremost aim of ADLIC-CTGOA model is to focus on the initiation of the effectual textual image captioning of an input image. Initially, the ADLIC-CTGOA method employs preprocessing phase to enhances both image and text data: images undergo noise removal and contrast enhancement to improve quality, while text is processed by removing numbers, converting to lowercase, and text vectorization. Next, the customized swin transformer is employed for feature extraction to capture fine-grained visual features from images. In addition, the BERT Transformer model is deployed for image captioning process. To enhance the performance of proposed technique, the chaotic Aquila optimization (CAO) technique was applied for parameter tuning for enhancing the performance. A wide sort of simulation studies are executed to ensure the improved performance of ADLIC-CTGOA system. The comparative result exploration reported the betterment of the ADLIC-CTGOA model on recent approaches in terms of different evaluation measures. 2026 2026 273 289 10.54216/JISIoT.180219 https://www.americaspg.com/articleinfo/18/show/4054