ChatGPT as an Assessment Design Tool in Higher Education:
Evaluating Item Quality, Bloom’s Taxonomy Coverage, and
Faculty Acceptance Across Academic Disciplines
Nadia Iftikhar1,* Rabia Muslu2
1 UCAM Catholic University of Murcia, Spain
2 Amity University Dubai, UAE
Emails: nadiaift@gmail.com · rabiamuslu.uae@gmail.com
Received: November 04, 2025 Revised: December 11, 2025 Accepted: January 18, 2026 ⋆ Corresponding author
ABSTRACT
The emergence of large language models capable of generating coherent, contextually grounded text at scale has
created a new and contested tool for higher education assessment design: instructors can now produce examination
questions, assignment prompts, and feedback rubrics in seconds rather than hours. Whether the items produced
by these systems meet the quality standards required for valid, reliable, and pedagogically appropriate higher
education assessment is an empirical question that the literature has only partially addressed. This paper reports
a three-study investigation of ChatGPT as an assessment design tool in higher education, covering item quality,
cognitive level coverage, student performance, and faculty acceptance. Study 1 presents an expert-panel evaluation of
360 assessment items—180 generated by ChatGPT and 180 created by experienced instructors—across six academic
disciplines and four item types, rated on seven quality dimensions including content accuracy, Bloom’s taxonomy
alignment, linguistic clarity, and originality. Study 2 reports a faculty survey of 186 instructors examining adoption
rates, perceived benefits, concerns, and the predictors of acceptance. Study 3 compares the performance of 412
students on counterbalanced ChatGPT-generated and instructor-created assessment items. ChatGPT-generated items
score significantly below instructor-created items on Bloom’s taxonomy alignment and originality, but perform
comparably or above on linguistic clarity and difficulty calibration. Student performance is modestly but significantly
higher on ChatGPT-generated items, a finding that challenges simple assumptions about AI-generated assessment
difficulty. Academic integrity concerns and higher-order cognitive coverage are the dominant faculty concerns,
while time savings—averaging 77% reduction in item-writing time—is the most consistently cited benefit. The
paper contributes a validated multi-dimensional item quality framework, a faculty acceptance model, and eight
evidence-based guidelines for the responsible integration of ChatGPT in assessment design workflows.
Keywords: ChatGPT Assessment design Exam questions Bloom’s taxonomy Item quality Faculty acceptance
Generative AI Higher education Artificial intelligence in education
1. INTRODUCTION
Assessment design is among the most time-intensive components
of university teaching. Creating a single wellconstructed
multiple-choice question that targets a specific
cognitive level, avoids item-writing flaws, and adequately
discriminates between students can take an experienced instructor
15–20 minutes [1]. Scaling that effort across a full