Early Detection of Student Dropout Risk in Higher
Education through Optimized Machine Learning
Aa Hubur1,∗, Aygul Z. Ibatova2
1Universitas Trisakti, Jakarta, Indonesia
2 Tyumen Industrial University, Russia
Emails: aa.hubur@trisakti.ac.id; aigoul@rambler.ru
Abstract
Student retention in higher education institutions is a critical problem that causes academic and financial
challenges to individual students and to schools and entire countries. The field of study should be in the
area of student retention as it enables educational facilities to provide appropriate intervention. The present
study implements a comparative analysis of five machine learning classifiers; Linear Discriminant Analysis,
K-Nearest Neighbours, Support Vector Machine, Random Forest and Gradient Boosting classifiers on data
of 4424 students who were selected from the Realinho et al. (2022) data set which contains demographic
and socioeconomic, and macroeconomic and academic performance data from a Portuguese higher education
institution over a decade. The mutual information feature selection step reduces the 22-dimensional feature
space prior to model training by selecting 12 features that have, statistically, the highest discriminative power.
Five-fold stratified cross-validation shows that the best overall performance is achieved by a SVM with a radial
basis function kernel with accuracy of 97.1% and F1 score of 0.954 and all five models achieve AUC greater
than 0.981. The importance analysis reveals that the combination of four measures of academic success from
the first two semesters constructs 87.6% of the signal that Random Forest model uses for prediction which is
driven by the most important predictor - number of curricular units that the student passes during the second
semester (importance= 0.335). The impact of all socioeconomic and demographic and macroeconomic factors
is less than 13%. The findings of the study have three implications about risk factors in student retention via
empirical measurement.
Keywords: Student dropout prediction; Machine learning; Educational data mining; Mutual information feature
selection; Higher education analytics; Support vector machine; Random Forest; Early warning systems