ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring. EdWorkingPaper No. 25-1335
Saved in:
| Title: | ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring. EdWorkingPaper No. 25-1335 |
|---|---|
| Language: | English |
| Authors: | Youngwon Kim, Reagan Mozer, Shireen Al-Adeimi, Luke Miratrix, Annenberg Institute for School Reform at Brown University |
| Source: | Annenberg Institute for School Reform at Brown University. 2025. |
| Availability: | Annenberg Institute for School Reform at Brown University. Brown University Box 1985, Providence, RI 02912. Tel: 401-863-7990; Fax: 401-863-1290; e-mail: annenberg@brown.edu; Web site: https://annenberg.brown.edu/ |
| Peer Reviewed: | N |
| Page Count: | 28 |
| Publication Date: | 2025 |
| Sponsoring Agency: | Institute of Education Sciences (ED) |
| Contract Number: | R305D220032 |
| Document Type: | Reports - Research |
| Education Level: | Elementary Education Grade 4 Intermediate Grades Grade 5 Middle Schools Grade 6 Grade 7 Junior High Schools Secondary Education Grade 10 High Schools |
| Descriptors: | Automation, Essays, Writing Evaluation, Scoring, Artificial Intelligence, Natural Language Processing, Algorithms, Grade 4, Grade 5, Grade 6, Grade 7, Persuasive Discourse, Grade 10, Classification, Prediction, Performance, Expository Writing, Evaluation Methods, Technology Uses in Education, Writing Skills |
| Abstract: | Automated Essay Scoring (AES) is a critical tool in education that aims to enhance the efficiency and objectivity of educational assessments. Recent advancements in Large Language Models (LLMs), such as ChatGPT, have sparked interest in their potential for AES. However, comprehensive comparisons of LLM-based methods with traditional machine learning (ML) methods across different assessment contexts remain limited. This study compares the efficacy of LLMs with supervised ML algorithms in assessing both categorical essay opinions and continuous writing quality scores. Using two distinct datasets--argumentative essays from 4th-7th graders about iPad usage in schools, and persuasive essays from 10th graders on censorship in libraries--we systematically assess the performance of ChatGPT compared to four tree-based ML algorithms trained on extensive statistical text features. Our findings show that while LLMs perform well in essay classification tasks, ML methods consistently outperform LLMs in predicting writing quality. We highlight the importance of prompting and fine tuning techniques in LLM-based scoring, along with the strengths and limitations of both approaches. We also discuss the potential of LLMs to enhance AES in educational settings while underscoring the continued importance of human oversight in evaluating complex writing skills. Overall, this study demonstrates the complementary strengths of different approaches to AES, providing guidance for researchers and educators interested in leveraging LLMs in educational assessment. |
| Abstractor: | As Provided |
| IES Funded: | Yes |
| Entry Date: | 2026 |
| Accession Number: | ED678295 |
| Database: | ERIC |
Be the first to leave a comment!