Early Identification of At-Risk Students in Virtual Learning
Environments Using Ensemble Machine Learning and
Behavioural Analytics
Ahmed Abd El-Badie Abd Allah Kamel1,∗
1Associate Professor of Computer Science and the Director of the Monitoring and Technical Support Unit at
the Measurement and Evaluation Center, Mansoura University, Egypt
Emails: ahmed abdelbadie@mans.edu.eg
Abstract
The academic success of students who are nearing academic failure should be Identifying students who are
at risk of academic failure or course withdrawal at an early stage of their enrolment remains one of the most
pressing challenges in higher and distance education. The research assesses the performance of seven machine
learning classifiers which include Logistic Regression Decision Tree Random Forest Gradient Boosting Decision
Tree (GBDT) AdaBoost Naive Bayes and Multilayer Perceptron for predicting student risk at an early
stage based on a behavioural and demographic dataset derived from the Open University Learning Analytics
Dataset (OULAD). The dataset contains 7895 student records which represent a single module and show eight
demographic factors together with eight Virtual Learning Environment (VLE) usage patterns. All classifiers
were evaluated through five-fold stratified cross-validation. The GBDT model achieved the best results
with an AUC-ROC value of 0.782 (± 0.003) and an accuracy rate of 0.708 (± 0.005) which produced an F1
score of 0.729 (± 0.006) and a recall rate of 0.769 (± 0.006). The analysis of feature importance showed that
late sub-mission count (I = 0.304) and total VLE clicks (I = 0.150) together with first assessment score (I
= 0.135) serve as the three most valuable predictive indicators because they help identify student
engagement patterns which become evident through VLE traces that educational institutions collect from
students during their first module. Educational institutions can utilize learning management system data to
implement effective combi-nation methods which enable them to execute necessary teaching methods even
though they do not need to gather additional expense data. The article presents design elements which both
create early warning systems and manage the ethical use of predictive analytics within educational systems.
Keywords: Learning analytics; Student at-risk prediction; Gradient boosting; Ensemble machine learning;
Virtual learning environment; Educational data mining; Early warning systems