Real-Time Gesture Recognition Using Attention-Based CNN-RNN Framework for Human-Robot Interaction

Journal of Intelligent Systems and Internet of Things JISIoT 2690-6791 2769-786X 10.54216/JISIoT https://www.americaspg.com/journals/show/4175 2019 2019 Real-Time Gesture Recognition Using Attention-Based CNN-RNN Framework for Human-Robot Interaction Assistant Professor, School of Computer Science Engineering, SRM Institute of Science and Technology, Ramapuram, Chennai, Tamilnadu, India R. R. Assistant Professor, Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India Chinnathambi Kamatchi Assistant Professor, Department of Electronics and Instrumentation Engineering, Sri Ramakrishna Engineering College, Coimbatore, Tamil Nadu, India Y. Dharshan Assistant Professor, Department of Electronics and Communication Engineering, Hindusthan Institute of Technology, Coimbatore, Tamil Nadu, India K. Kowsalya Assistant Professor, Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (Deemed to be University), Andhra Pradesh, India R. Vijay Professor, Department of Artificial Intelligence and Data Science, Dr. Mahalingam College of Engineering and Technology, Pollachi, Coimbatore, Tamil Nadu, India M. Balakrishnan Gesture recognition serves as a key enabler for natural and intuitive human–robot interaction (HRI) in smart automation and assistive systems. However, achieving real-time performance with high recognition accuracy remains a significant challenge due to dynamic background variations, occlusion, and complex spatio-temporal dependencies in gesture sequences. This paper presents a real-time attention-based CNN-RNN framework for robust gesture recognition and adaptive HRI in dynamic environments. The proposed system utilizes Convolutional Neural Networks (CNNs) for spatial feature extraction from sequential video frames and Bidirectional Recurrent Neural Networks (BiRNNs)—integrated with an attention mechanism—for modeling temporal dependencies and focusing on discriminative motion cues. The attention layer enhances interpretability by prioritizing salient gestures and reducing background noise. A hybrid optimization strategy, combining adaptive learning rate scheduling and regularized dropout, ensures computational stability and generalization across gesture datasets. Experiments conducted on benchmark datasets such as NVIDIA Dynamic Gesture (NvGesture) and ChaLearn IsoGD demonstrate superior performance, achieving an accuracy of 97.8% and a real-time inference speed of 34 FPS, outperforming baseline CNN, 3D-CNN, and LSTM architectures. The proposed framework effectively balances accuracy, latency, and interpretability, making it suitable for real-world HRI applications, including service robotics, industrial automation, and assistive technologies. 2025 2025 398 408 10.54216/JISIoT.170128 https://www.americaspg.com/articleinfo/18/show/4175