Machine Learning for At-Risk Student Identification in
Virtual Learning Environments: A Multi-Classifier Analysis
Using the Open University Learning Analytics Dataset
Emad Bashkail1,*, Nesrin Merhi2
1Food Industries Polytechnic, Al Kharj, KSA
2Jeddah International School, Jeddah, KSA
Emails: bashkail@gmail.com; Merhy81@yahoo.com
Abstract
The detection of students who will face academic difficulties or leave their studies during
their initial course period provides universities with a brief time frame to develop effective
solutions. This research paper conducts a systematic analysis which tests multiple machine
learning classifiers on the Open University Learning Analytics Dataset (OULAD) which
serves as one of the most widely used public educational datasets that presents data from
32593 students who studied 22 different courses through distance learning. The four classification
methods include logistic regression decision tree random forest and gradient boosting
which use a feature set that combines student demographic information and virtual learning
environment (VLE) clickstream-based engagement data. The primary discovery shows that
VLE behavioral characteristics constitute the most important elements for Random Forest
which identifies total click volume and active VLE days and typical daily click volume as
its top four elements which make up 92.8% of total importance while demographic information
has less impact. Random Forest achieves the strongest held-out test performance
(AUC = 0.998, F1 = 0.978, accuracy = 98.2%) while Decision Tree shows lower results
with AUC = 0.959 which demonstrates how performance losses occur when systems need to
be understandable. At-risk students in the two groups present a 75.8% decrease in total VLE
DOI: https://doi.org/10.54216/IJAIET.040105 50
Int. J. of AI and Education Technology (IJAIET) Vol. 04, No. 01, PP. 50–66, 2025
clicks which results in an average of 49.0 clicks compared to 203.0 clicks with a t value of
104.0 and a p value less than 0.001. The research describes its complete end-to-end prediction
pipeline which includes details about its model evaluation framework and its dataset to enable
future researchers to reproduce the study. The results have direct implications for the design
of early-alert systems and the ethical deployment of predictive analytics in higher education.
Keywords: Learning analytics; Virtual learning environment; At-risk prediction; Random
forest; OULAD; Educational data mining; Student engagement; Early warning system