An interpretable framework for task response in IELTS writing: integrating thematic analysis, argumentation dimensions, and GPT-4 for educational feedback.

Saved in:
Bibliographic Details
Title: An interpretable framework for task response in IELTS writing: integrating thematic analysis, argumentation dimensions, and GPT-4 for educational feedback.
Authors: Li, Weiqiang1 (AUTHOR) p115134@siswa.ukm.edu.my, Sulaiman, Nur Ainil1 (AUTHOR) nurainil@ukm.edu.my, Mohd Matore, Mohd Effendi Ewan1 (AUTHOR) effendi@ukm.edu.my, Ke, Xin2 (AUTHOR) p120530@siswa.ukm.edu.my
Source: Language Testing in Asia. 3/12/2026, Vol. 16 Issue 1, p1-24. 24p.
Subject Terms: *International English Language Testing System, *Formative evaluation, *Psychological feedback, Generative pre-trained transformers, Thematic analysis, Essays, Text mining
Abstract: The assessment of Task Response (TR) in high-stakes exams like IELTS is plagued by scorer subjectivity, conceptual vagueness, and limited pedagogical utility, which Automated Writing Evaluation (AWE) tools, including "black-box" Generative AI, have not consistently addressed in formative contexts. This challenge is further compounded by variable levels of feedback literacy among learners. To bridge these gaps, this exploratory, proof-of-concept study develops an interpretable, five-dimensional TR framework (Coverage, Opinion, Claim, Grounds, Penalty) through thematic analysis of 28 official IELTS essays (Bands 5.0–7.0). The proposed framework demonstrates promising inter-rater reliability (Cohen's kappa = 0.827) and strong internal content validity (S-CVI = 0.92) within the context of official descriptors. Preliminary results from Kruskal-Wallis (p < 0.001) and Spearman correlations (r > 0.90 for Claim and Grounds) suggest its potential to discriminate between proficiency levels. By mapping this framework onto GPT-4 via optimized prompts, the study shifts AI's role from an opaque scorer (Mean Absolute Error, MAE = 0.54; correlation with human scores, r = 0.39) toward a structured diagnostic feedback tool (MAE = 0.41; r = 0.61). While these results represent a promising proof-of-concept for formative use, the levels of agreement do not yet meet the requirements for high-stakes summative assessment. Given the limited sample size and exploratory design, these findings should be interpreted cautiously as preliminary evidence for framework viability. The study highlights the framework's potential to enhance pedagogical transparency and support self-regulation through ZPD-aligned interventions. Future research is needed to validate the framework in larger and more diverse samples and to examine its practical impact in classroom settings. [ABSTRACT FROM AUTHOR]
Copyright of Language Testing in Asia is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Education Research Complete
Description
Abstract:The assessment of Task Response (TR) in high-stakes exams like IELTS is plagued by scorer subjectivity, conceptual vagueness, and limited pedagogical utility, which Automated Writing Evaluation (AWE) tools, including "black-box" Generative AI, have not consistently addressed in formative contexts. This challenge is further compounded by variable levels of feedback literacy among learners. To bridge these gaps, this exploratory, proof-of-concept study develops an interpretable, five-dimensional TR framework (Coverage, Opinion, Claim, Grounds, Penalty) through thematic analysis of 28 official IELTS essays (Bands 5.0–7.0). The proposed framework demonstrates promising inter-rater reliability (Cohen's kappa = 0.827) and strong internal content validity (S-CVI = 0.92) within the context of official descriptors. Preliminary results from Kruskal-Wallis (p < 0.001) and Spearman correlations (r > 0.90 for Claim and Grounds) suggest its potential to discriminate between proficiency levels. By mapping this framework onto GPT-4 via optimized prompts, the study shifts AI's role from an opaque scorer (Mean Absolute Error, MAE = 0.54; correlation with human scores, r = 0.39) toward a structured diagnostic feedback tool (MAE = 0.41; r = 0.61). While these results represent a promising proof-of-concept for formative use, the levels of agreement do not yet meet the requirements for high-stakes summative assessment. Given the limited sample size and exploratory design, these findings should be interpreted cautiously as preliminary evidence for framework viability. The study highlights the framework's potential to enhance pedagogical transparency and support self-regulation through ZPD-aligned interventions. Future research is needed to validate the framework in larger and more diverse samples and to examine its practical impact in classroom settings. [ABSTRACT FROM AUTHOR]
ISSN:22290443
DOI:10.1186/s40468-026-00437-5