ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses
Saved in:
| Title: | ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses |
|---|---|
| Language: | English |
| Authors: | Rabia Damla Karaçeper (ORCID |
| Source: | International Journal of Assessment Tools in Education. 2026 13(1):66-94. |
| Availability: | International Journal of Assessment Tools in Education. Pamukkale University, Faculty of Education, Kinikli Campus, Denizli 20070, Turkey. e-mail: ijate.editor@gmail.com; Web site: https://dergipark.org.tr/en/pub/ijate |
| Peer Reviewed: | Y |
| Page Count: | 29 |
| Publication Date: | 2026 |
| Document Type: | Journal Articles Reports - Research Tests/Questionnaires |
| Education Level: | Higher Education Postsecondary Education Secondary Education |
| Descriptors: | Automation, Scoring, Writing Evaluation, Technology Uses in Education, Artificial Intelligence, Evaluation Methods, English (Second Language), Second Language Learning, Persuasive Discourse, Essays, College Preparation, Second Language Instruction, Public Colleges, Foreign Countries, Attitudes, Barriers |
| Geographic Terms: | Turkey |
| ISSN: | 2148-7456 |
| Abstract: | ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners' written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring. |
| Abstractor: | As Provided |
| Entry Date: | 2026 |
| Accession Number: | EJ1495977 |
| Database: | ERIC |
| FullText | Text: Availability: 0 CustomLinks: – Url: https://eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=EJ1495977 Name: ERIC Full Text Category: fullText Text: Full Text from ERIC |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: EJ1495977 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Rabia+Damla+Karaçeper%22">Rabia Damla Karaçeper</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-6155-9973">0000-0001-6155-9973</externalLink>)<br /><searchLink fieldCode="AR" term="%22Gülay+Kiray%22">Gülay Kiray</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0003-2045-8636">0000-0003-2045-8636</externalLink>) – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22International+Journal+of+Assessment+Tools+in+Education%22"><i>International Journal of Assessment Tools in Education</i></searchLink>. 2026 13(1):66-94. – Name: Avail Label: Availability Group: Avail Data: International Journal of Assessment Tools in Education. Pamukkale University, Faculty of Education, Kinikli Campus, Denizli 20070, Turkey. e-mail: ijate.editor@gmail.com; Web site: https://dergipark.org.tr/en/pub/ijate – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 29 – Name: DatePubCY Label: Publication Date Group: Date Data: 2026 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research<br />Tests/Questionnaires – Name: Audience Label: Education Level Group: Audnce Data: <searchLink fieldCode="EL" term="%22Higher+Education%22">Higher Education</searchLink><br /><searchLink fieldCode="EL" term="%22Postsecondary+Education%22">Postsecondary Education</searchLink><br /><searchLink fieldCode="EL" term="%22Secondary+Education%22">Secondary Education</searchLink> – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Automation%22">Automation</searchLink><br /><searchLink fieldCode="DE" term="%22Scoring%22">Scoring</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Evaluation%22">Writing Evaluation</searchLink><br /><searchLink fieldCode="DE" term="%22Technology+Uses+in+Education%22">Technology Uses in Education</searchLink><br /><searchLink fieldCode="DE" term="%22Artificial+Intelligence%22">Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Methods%22">Evaluation Methods</searchLink><br /><searchLink fieldCode="DE" term="%22English+%28Second+Language%29%22">English (Second Language)</searchLink><br /><searchLink fieldCode="DE" term="%22Second+Language+Learning%22">Second Language Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Persuasive+Discourse%22">Persuasive Discourse</searchLink><br /><searchLink fieldCode="DE" term="%22Essays%22">Essays</searchLink><br /><searchLink fieldCode="DE" term="%22College+Preparation%22">College Preparation</searchLink><br /><searchLink fieldCode="DE" term="%22Second+Language+Instruction%22">Second Language Instruction</searchLink><br /><searchLink fieldCode="DE" term="%22Public+Colleges%22">Public Colleges</searchLink><br /><searchLink fieldCode="DE" term="%22Foreign+Countries%22">Foreign Countries</searchLink><br /><searchLink fieldCode="DE" term="%22Attitudes%22">Attitudes</searchLink><br /><searchLink fieldCode="DE" term="%22Barriers%22">Barriers</searchLink> – Name: Subject Label: Geographic Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Turkey%22">Turkey</searchLink> – Name: ISSN Label: ISSN Group: ISSN Data: 2148-7456 – Name: Abstract Label: Abstract Group: Ab Data: ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners' written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2026 – Name: AN Label: Accession Number Group: ID Data: EJ1495977 |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1495977 |
| RecordInfo | BibRecord: BibEntity: Languages: – Text: English PhysicalDescription: Pagination: PageCount: 29 StartPage: 66 Subjects: – SubjectFull: Automation Type: general – SubjectFull: Scoring Type: general – SubjectFull: Writing Evaluation Type: general – SubjectFull: Technology Uses in Education Type: general – SubjectFull: Artificial Intelligence Type: general – SubjectFull: Evaluation Methods Type: general – SubjectFull: English (Second Language) Type: general – SubjectFull: Second Language Learning Type: general – SubjectFull: Persuasive Discourse Type: general – SubjectFull: Essays Type: general – SubjectFull: College Preparation Type: general – SubjectFull: Second Language Instruction Type: general – SubjectFull: Public Colleges Type: general – SubjectFull: Foreign Countries Type: general – SubjectFull: Attitudes Type: general – SubjectFull: Barriers Type: general – SubjectFull: Turkey Type: general Titles: – TitleFull: ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Rabia Damla Karaçeper – PersonEntity: Name: NameFull: Gülay Kiray IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2026 Identifiers: – Type: issn-electronic Value: 2148-7456 Numbering: – Type: volume Value: 13 – Type: issue Value: 1 Titles: – TitleFull: International Journal of Assessment Tools in Education Type: main |
| ResultId | 1 |