ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses

Saved in:
Bibliographic Details
Title: ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses
Language: English
Authors: Rabia Damla Karaçeper (ORCID 0000-0001-6155-9973), Gülay Kiray (ORCID 0000-0003-2045-8636)
Source: International Journal of Assessment Tools in Education. 2026 13(1):66-94.
Availability: International Journal of Assessment Tools in Education. Pamukkale University, Faculty of Education, Kinikli Campus, Denizli 20070, Turkey. e-mail: ijate.editor@gmail.com; Web site: https://dergipark.org.tr/en/pub/ijate
Peer Reviewed: Y
Page Count: 29
Publication Date: 2026
Document Type: Journal Articles
Reports - Research
Tests/Questionnaires
Education Level: Higher Education
Postsecondary Education
Secondary Education
Descriptors: Automation, Scoring, Writing Evaluation, Technology Uses in Education, Artificial Intelligence, Evaluation Methods, English (Second Language), Second Language Learning, Persuasive Discourse, Essays, College Preparation, Second Language Instruction, Public Colleges, Foreign Countries, Attitudes, Barriers
Geographic Terms: Turkey
ISSN: 2148-7456
Abstract: ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners' written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.
Abstractor: As Provided
Entry Date: 2026
Accession Number: EJ1495977
Database: ERIC
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=EJ1495977
    Name: ERIC Full Text
    Category: fullText
    Text: Full Text from ERIC
Header DbId: eric
DbLabel: ERIC
An: EJ1495977
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Rabia+Damla+Karaçeper%22">Rabia Damla Karaçeper</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0001-6155-9973">0000-0001-6155-9973</externalLink>)<br /><searchLink fieldCode="AR" term="%22Gülay+Kiray%22">Gülay Kiray</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0003-2045-8636">0000-0003-2045-8636</externalLink>)
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22International+Journal+of+Assessment+Tools+in+Education%22"><i>International Journal of Assessment Tools in Education</i></searchLink>. 2026 13(1):66-94.
– Name: Avail
  Label: Availability
  Group: Avail
  Data: International Journal of Assessment Tools in Education. Pamukkale University, Faculty of Education, Kinikli Campus, Denizli 20070, Turkey. e-mail: ijate.editor@gmail.com; Web site: https://dergipark.org.tr/en/pub/ijate
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: Y
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 29
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2026
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Journal Articles<br />Reports - Research<br />Tests/Questionnaires
– Name: Audience
  Label: Education Level
  Group: Audnce
  Data: <searchLink fieldCode="EL" term="%22Higher+Education%22">Higher Education</searchLink><br /><searchLink fieldCode="EL" term="%22Postsecondary+Education%22">Postsecondary Education</searchLink><br /><searchLink fieldCode="EL" term="%22Secondary+Education%22">Secondary Education</searchLink>
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Automation%22">Automation</searchLink><br /><searchLink fieldCode="DE" term="%22Scoring%22">Scoring</searchLink><br /><searchLink fieldCode="DE" term="%22Writing+Evaluation%22">Writing Evaluation</searchLink><br /><searchLink fieldCode="DE" term="%22Technology+Uses+in+Education%22">Technology Uses in Education</searchLink><br /><searchLink fieldCode="DE" term="%22Artificial+Intelligence%22">Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Evaluation+Methods%22">Evaluation Methods</searchLink><br /><searchLink fieldCode="DE" term="%22English+%28Second+Language%29%22">English (Second Language)</searchLink><br /><searchLink fieldCode="DE" term="%22Second+Language+Learning%22">Second Language Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Persuasive+Discourse%22">Persuasive Discourse</searchLink><br /><searchLink fieldCode="DE" term="%22Essays%22">Essays</searchLink><br /><searchLink fieldCode="DE" term="%22College+Preparation%22">College Preparation</searchLink><br /><searchLink fieldCode="DE" term="%22Second+Language+Instruction%22">Second Language Instruction</searchLink><br /><searchLink fieldCode="DE" term="%22Public+Colleges%22">Public Colleges</searchLink><br /><searchLink fieldCode="DE" term="%22Foreign+Countries%22">Foreign Countries</searchLink><br /><searchLink fieldCode="DE" term="%22Attitudes%22">Attitudes</searchLink><br /><searchLink fieldCode="DE" term="%22Barriers%22">Barriers</searchLink>
– Name: Subject
  Label: Geographic Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Turkey%22">Turkey</searchLink>
– Name: ISSN
  Label: ISSN
  Group: ISSN
  Data: 2148-7456
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners' written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2026
– Name: AN
  Label: Accession Number
  Group: ID
  Data: EJ1495977
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1495977
RecordInfo BibRecord:
  BibEntity:
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 29
        StartPage: 66
    Subjects:
      – SubjectFull: Automation
        Type: general
      – SubjectFull: Scoring
        Type: general
      – SubjectFull: Writing Evaluation
        Type: general
      – SubjectFull: Technology Uses in Education
        Type: general
      – SubjectFull: Artificial Intelligence
        Type: general
      – SubjectFull: Evaluation Methods
        Type: general
      – SubjectFull: English (Second Language)
        Type: general
      – SubjectFull: Second Language Learning
        Type: general
      – SubjectFull: Persuasive Discourse
        Type: general
      – SubjectFull: Essays
        Type: general
      – SubjectFull: College Preparation
        Type: general
      – SubjectFull: Second Language Instruction
        Type: general
      – SubjectFull: Public Colleges
        Type: general
      – SubjectFull: Foreign Countries
        Type: general
      – SubjectFull: Attitudes
        Type: general
      – SubjectFull: Barriers
        Type: general
      – SubjectFull: Turkey
        Type: general
    Titles:
      – TitleFull: ChatGPT-4o as an Automated Scoring Tool for Writing Assessment: Strengths and Weaknesses
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Rabia Damla Karaçeper
      – PersonEntity:
          Name:
            NameFull: Gülay Kiray
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-electronic
              Value: 2148-7456
          Numbering:
            – Type: volume
              Value: 13
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: International Journal of Assessment Tools in Education
              Type: main
ResultId 1