View in EDS

ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring. EdWorkingPaper No. 25-1335

Saved in:

Bibliographic Details
Title:	ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring. EdWorkingPaper No. 25-1335
Language:	English
Authors:	Youngwon Kim, Reagan Mozer, Shireen Al-Adeimi, Luke Miratrix, Annenberg Institute for School Reform at Brown University
Source:	Annenberg Institute for School Reform at Brown University. 2025.
Availability:	Annenberg Institute for School Reform at Brown University. Brown University Box 1985, Providence, RI 02912. Tel: 401-863-7990; Fax: 401-863-1290; e-mail: annenberg@brown.edu; Web site: https://annenberg.brown.edu/
Peer Reviewed:	N
Page Count:	28
Publication Date:	2025
Sponsoring Agency:	Institute of Education Sciences (ED)
Contract Number:	R305D220032
Document Type:	Reports - Research
Education Level:	Elementary Education Grade 4 Intermediate Grades Grade 5 Middle Schools Grade 6 Grade 7 Junior High Schools Secondary Education Grade 10 High Schools
Descriptors:	Automation, Essays, Writing Evaluation, Scoring, Artificial Intelligence, Natural Language Processing, Algorithms, Grade 4, Grade 5, Grade 6, Grade 7, Persuasive Discourse, Grade 10, Classification, Prediction, Performance, Expository Writing, Evaluation Methods, Technology Uses in Education, Writing Skills
Abstract:	Automated Essay Scoring (AES) is a critical tool in education that aims to enhance the efficiency and objectivity of educational assessments. Recent advancements in Large Language Models (LLMs), such as ChatGPT, have sparked interest in their potential for AES. However, comprehensive comparisons of LLM-based methods with traditional machine learning (ML) methods across different assessment contexts remain limited. This study compares the efficacy of LLMs with supervised ML algorithms in assessing both categorical essay opinions and continuous writing quality scores. Using two distinct datasets--argumentative essays from 4th-7th graders about iPad usage in schools, and persuasive essays from 10th graders on censorship in libraries--we systematically assess the performance of ChatGPT compared to four tree-based ML algorithms trained on extensive statistical text features. Our findings show that while LLMs perform well in essay classification tasks, ML methods consistently outperform LLMs in predicting writing quality. We highlight the importance of prompting and fine tuning techniques in LLM-based scoring, along with the strengths and limitations of both approaches. We also discuss the potential of LLMs to enhance AES in educational settings while underscoring the continued importance of human oversight in evaluating complex writing skills. Overall, this study demonstrates the complementary strengths of different approaches to AES, providing guidance for researchers and educators interested in leveraging LLMs in educational assessment.
Abstractor:	As Provided
IES Funded:	Yes
Entry Date:	2026
Accession Number:	ED678295
Database:	ERIC

Full Text from ERIC

Description
Abstract:	Automated Essay Scoring (AES) is a critical tool in education that aims to enhance the efficiency and objectivity of educational assessments. Recent advancements in Large Language Models (LLMs), such as ChatGPT, have sparked interest in their potential for AES. However, comprehensive comparisons of LLM-based methods with traditional machine learning (ML) methods across different assessment contexts remain limited. This study compares the efficacy of LLMs with supervised ML algorithms in assessing both categorical essay opinions and continuous writing quality scores. Using two distinct datasets--argumentative essays from 4th-7th graders about iPad usage in schools, and persuasive essays from 10th graders on censorship in libraries--we systematically assess the performance of ChatGPT compared to four tree-based ML algorithms trained on extensive statistical text features. Our findings show that while LLMs perform well in essay classification tasks, ML methods consistently outperform LLMs in predicting writing quality. We highlight the importance of prompting and fine tuning techniques in LLM-based scoring, along with the strengths and limitations of both approaches. We also discuss the potential of LLMs to enhance AES in educational settings while underscoring the continued importance of human oversight in evaluating complex writing skills. Overall, this study demonstrates the complementary strengths of different approaches to AES, providing guidance for researchers and educators interested in leveraging LLMs in educational assessment.