Analytic Trait Scoring of English Language Learner
Essays Using Fine-Tuned DeBERTa-v3 on the ELLIPSE
Corpus
Andino Maseleno1,∗, Meinhaj Hussain2
1Institut Bakti Nusantara, Lampung, Indonesia
2Rennier University, Ireland
Emails: andino.maseleno@ibnus.ac.id; meinhaj@rennier.online
Abstract
Automated Essay Scoring (AES) technologies have been extensively researched for holistic, topic-specific
scoring, but their use to predict multiple analytic writing quality traits of English Language Learner (ELL)
student essays has received less attention. This research contributes to this knowledge gap by systematically
investigating multi-trait AES on the ELLIPSE corpus (Learning Agency Lab, 2022), a publicly accessible
dataset of 6,482 argumentative essays written by grades 8-12 ELLs and rated by human raters on six analytic
traits: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. We experiment with five
models: Ridge regression, Support Vector Regression (SVR) with a radial basis function (RBF) kernel,
Random For-est, fine-tuned BERT-base-uncased and fine-tuned DeBERTa-v3-base. The mean Quadratic
Weighted Kappa (QWK) across six traits is highest for DeBERTa-v3 (0.726) - a 26.5-point improvement
over the Ridge base-line (0.461) and a 6-point improvement over BERT (0.666). Phraseology is the most
difficult trait to score automatically (DeBERTa QWK = 0.701) and cohesion the easiest (DeBERTa QWK
= 0.742). Analysis of inter-trait correlations reveals high co-variation between vocabulary and phraseology
(r = 0.79), which may reflect common linguistic skills that can be leveraged by multi-task learning. This
research sets a replicable baseline for multi-trait AES on the ELLIPSE corpus, and suggests that phraseology
scoring is the most urgent area for future architectural innovation.
Keywords: Automated essay scoring; English language learners; Multi-trait assessment; DeBERTa; Finetuning;
ELLIPSE corpus; Quadratic weighted kappa; Analytic writing rubric