An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1

Yousef A. Alsamaani; Murad A. Rassam

doi:https://doi.org/10.54216/FPA.210205

An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1

Yousef A. Alsamaani ¹ , Murad A. Rassam ^{2
*}

1 Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia - (441112423@qu.edu.sa)

2 Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia - (M.Qasem@qu.edu.sa)

Doi: https://doi.org/10.54216/FPA.210205

Received: March 09, 2025 Revised: June 03, 2025 Accepted: July 01, 2025

Abstract

Generative AI has made significant strides over the past few years, and this progress has accelerated the development of deepfake techniques, which can unfortunately be used for harmful purposes. It is essential to stay up-to-date with this advancement. In this paper, we present an explainable weighted average fusion deepfake detection system that combines Vision Transformer (ViT) and InceptionResNetV1 to improve classification accuracy. We also employed LIME and GradCAM++ to provide interpretability for the model decision. ViT utilizes self-attention modules to extract features, whereas InceptionResNetV1 employs convolutional layers to extract spatial features. Grad-CAM++ highlights the important regions influencing classification, and LIME examines the regional contributions. Together, these tools offer a deeper understanding of the model's decision-making process. Our fusion technique combines the outputs of both models by assigning specific weights that users can adjust interactively through the user interface. The use of these tools gives a better understanding of how the model classifies, which improves transparency and reliability in the models. The performance of the fusion strategy is tested with accuracy, precision, recall, and F1-score. Our proposed model achieves a classification accuracy of 99.19%, surpassing both ViT and InceptionResNetV1 when we evaluated them individually. To the best of our knowledge, this work represents the first deepfake detection model that combines Vision Transformer (ViT) and InceptionResNetV1 using a weighted averaging fusion approach with dual explainability techniques.

Keywords :

Deepfake Detection , Machine Learning , Deep Learning , Detection Framework , Explainable Artificial Intelligence (XAI)

References

[1] G. Vecchietti, G. Liyanaarachchi, and G. Viglia, “Managing deepfakes with artificial intelligence: Introducing the business privacy calculus,” J. Bus. Res., vol. 186, p. 115010, Jan. 2025, doi: 10.1016/j.jbusres.2024.115010.

[2] F. Abbas and A. Taeihagh, “Unmasking deepfakes: A systematic review of deepfake detection and generation techniques using artificial intelligence,” Expert Syst. Appl., vol. 252, p. 124260, Oct. 2024, doi: 10.1016/j.eswa.2024.124260.

[3] Amerini et al., “Deepfake Media Forensics: Status and Future Challenges,” J. Imaging, vol. 11, no. 3, p. 73, Feb. 2025, doi: 10.3390/jimaging11030073.

[4] R. Babaei, S. Cheng, R. Duan, and S. Zhao, “Generative Artificial Intelligence and the Evolving Challenge of Deepfake Detection: A Systematic Analysis,” J. Sensor Actuator Netw., vol. 14, no. 1, p. 17, Feb. 2025, doi: 10.3390/jsan14010017.

[5] M. Nagm et al., “Detecting image manipulation with ELACNN integration: a powerful framework for authenticity verification,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/peerj-cs.2205.

[6] G. A. Pereira and M. Hussain, “A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships,” arXiv, 2024.

[7] J. Maurício, I. Domingues, and J. Bernardino, “Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review,” Appl. Sci., vol. 13, no. 9, p. 5521, Apr. 2023, doi: 10.3390/app13095521.

[8] V. Bengani, “Hybrid Learning Systems: Integrating Traditional Machine Learning with Deep learning Techniques,” Tech. Rep., 2024, doi: 10.13140/RG.2.2.10461.22244/1.

[9] A. Adeniran, A. P. Onebunne, and P. William, “Explainable AI (XAI) in healthcare: Enhancing trust and transparency in critical decision-making,” World J. Adv. Res. Rev., vol. 23, no. 3, pp. 2447–2658, Sep. 2024, doi: 10.30574/wjarr.2024.23.3.2936.

[10] S. Raghuvanshi, “Machine Explainability: A Guide to LIME, SHAP, and Gradcam,” Medium. Accessed: Mar. 16, 2025. [Online]. Available: https://suryansh-raghuvanshi.medium.com/machine-explainability-a-guide-to-lime-shap-and-gradcam-60f6265f365f

[11] V. L. L. Thing, “Deepfake Detection with Deep Learning: Convolutional Neural Networks versus Transformers,” in Proc. IEEE Int. Conf. Cyber Secur. Resilience (CSR), Jul. 2023, pp. 246–253, doi: 10.1109/CSR57506.2023.10225004.

[12] B. Kaddar, S. A. Fezza, W. Hamidouche, Z. Akhtar, and A. Hadid, “HCiT: Deepfake Video Detection Using a Hybrid Model of CNN features and Vision Transformer,” in Proc. Int. Conf. Vis. Commun. Image Process. (VCIP), Dec. 2021, pp. 1–5, doi: 10.1109/VCIP53242.2021.9675402.

[13] S. A. Khan and D.-T. Dang-Nguyen, “Hybrid Transformer Network for Deepfake Detection,” in Proc. Int. Conf. Content-based Multimedia Indexing, New York, NY, USA: ACM, Sep. 2022, pp. 8–14, doi: 10.1145/3549555.3549588.

[14] Koçak, M. Alkan, and S. M. Arıkan, “Deepfake Video Detection Using Convolutional Neural Network Based Hybrid Approach,” J. Polytechnic, Sep. 2024, doi: 10.2339/politeknik.1523983.

[15] W. H. Abir et al., “Detecting Deepfake Images Using Deep Learning Techniques and Explainable AI Methods,” Intell. Autom. Soft Comput., vol. 35, no. 2, pp. 2151–2169, 2023, doi: 10.32604/iasc.2023.029653.

[16] H. Soudy, O. Sayed, H. Tag-Elser, et al., “Deepfake detection using convolutional vision transformers and convolutional neural networks,” Neural Comput. Applic, vol. 36, pp. 19759–19775, 2024, doi: 10.1007/s00521-024-10181-7.

[17] Cavia, L. Huang, and R. Smith, “Real-Time Deepfake Detection in the Real-World,” arXiv: 2406.09398, Jun. 2024. [Online]. Available: https://arxiv.org/abs/2406.09398

[18] F. Wodajo, T. Mekonnen, and G. Alemu, “Deepfake Video Detection Using Generative Convolutional Vision Transformer,” arXiv: 2307.07036, Jul. 2023. [Online]. Available: https://arxiv.org/abs/2307.07036

[19] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nat. Mach. Intell., vol. 1, no. 5, pp. 206–215, May 2019, doi: 10.1038/s42256-019-0048-x.

[20] R. Mubarak et al., “A Survey on the Detection and Impacts of Deepfakes in Visual, Audio, and Textual Formats,” IEEE Access, vol. 11, pp. 144497–144529, 2023, doi: 10.1109/ACCESS.2023.3344653.

[21] xhlulu, “140k Real and Fake Faces,” Kaggle. Accessed: Mar. 21, 2025. [Online]. Available: https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces/data

[22] K. Manjil, “deepfake and real images,” Kaggle. Accessed: Mar. 21, 2025. [Online]. Available: https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images

[23] Hugging Face, “ViT-Base-Patch16-224 Model Card,” Hugging Face. Accessed: Mar. 21, 2025. [Online]. Available: https://huggingface.co/google/vit-base-patch16-224/tree/main

[24] R. R. Selvaraju et al., “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 618–626, doi: 10.1109/ICCV.2017.74.

[25] J. An, Y. Zhang, and I. Joe, “Specific-Input LIME Explanations for Tabular Data Based on Deep Learning Models,” Appl. Sci., vol. 13, no. 15, p. 8782, Jul. 2023, doi: 10.3390/app13158782.

[26] B. Smith, J. Doe, and R. Johnson, “Deepfake Detection Using Machine Learning Techniques: A Survey,” J. Mach. Learn. Res., vol. 25, no. 1, pp. 1-30, Jan. 2024.

Cite This Article As :

A., Yousef. , A., Murad. An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1. Fusion: Practice and Applications, vol. , no. , 2026, pp. 70-92. DOI: https://doi.org/10.54216/FPA.210205

A., Y. A., M. (2026). An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1. Fusion: Practice and Applications, (), 70-92. DOI: https://doi.org/10.54216/FPA.210205

A., Yousef. A., Murad. An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1. Fusion: Practice and Applications , no. (2026): 70-92. DOI: https://doi.org/10.54216/FPA.210205

A., Y. , A., M. (2026) . An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1. Fusion: Practice and Applications , () , 70-92 . DOI: https://doi.org/10.54216/FPA.210205

A. Y. , A. M. [2026]. An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1. Fusion: Practice and Applications. (): 70-92. DOI: https://doi.org/10.54216/FPA.210205

A., Y. A., M. "An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1," Fusion: Practice and Applications, vol. , no. , pp. 70-92, 2026. DOI: https://doi.org/10.54216/FPA.210205

Fusion: Practice and Applications

Journal DOI

Journal Menu

Journal Volumes

Volume 1

Volume 2

Volume 3

Volume 4

Volume 5

Volume 6

Volume 7

Volume 8

Volume 9

Volume 10

Volume 11

Volume 12

Volume 13

Volume 14

Volume 15

Volume 16

Volume 17

Volume 18

Volume 19

Volume 20

Volume 21

An Explainable AI Fusion-Based Model for Enhanced Deepfake Detection Using Vision Transformer and InceptionResNetV1

Abstract

Keywords :

References

Cite This Article As :

Article Statistics

Download