An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1

Fusion: Practice and Applications FPA 2692-4048 2770-0070 10.54216/FPA https://www.americaspg.com/journals/show/4048 2018 2018 An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1 Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia Murad Murad Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia Murad A. Rassam Generative AI has made significant strides over the past few years, and this progress has accelerated the development of deepfake techniques, which can unfortunately be used for harmful purposes. It is essential to stay up-to-date with this advancement. In this paper, we present an explainable weighted average fusion deepfake detection system that combines Vision Transformer (ViT) and InceptionResNetV1 to improve classification accuracy. We also employed LIME and GradCAM++ to provide interpretability for the model decision. ViT utilizes self-attention modules to extract features, whereas InceptionResNetV1 employs convolutional layers to extract spatial features. Grad-CAM++ highlights the important regions influencing classification, and LIME examines the regional contributions. Together, these tools offer a deeper understanding of the model's decision-making process. Our fusion technique combines the outputs of both models by assigning specific weights that users can adjust interactively through the user interface. The use of these tools gives a better understanding of how the model classifies, which improves transparency and reliability in the models. The performance of the fusion strategy is tested with accuracy, precision, recall, and F1-score. Our proposed model achieves a classification accuracy of 99.19%, surpassing both ViT and InceptionResNetV1 when we evaluated them individually. To the best of our knowledge, this work represents the first deepfake detection model that combines Vision Transformer (ViT) and InceptionResNetV1 using a weighted averaging fusion approach with dual explainability techniques. 2026 2026 70 92 10.54216/FPA.210205 https://www.americaspg.com/articleinfo/3/show/4048