View in EDS

Graders of the Future: Comparing the Consistency and Accuracy of GPT4 and Pre-Service Teachers in Physics Essay Question Assessments

Saved in:

Bibliographic Details
Title:	Graders of the Future: Comparing the Consistency and Accuracy of GPT4 and Pre-Service Teachers in Physics Essay Question Assessments
Language:	English
Authors:	Yubin Xu (ORCID 0009-0009-5967-3270), Lin Liu, Jianwen Xiong, Guangtian Zhu (ORCID 0000-0002-3677-8143)
Source:	Journal of Baltic Science Education. 2025 24(1):187-207.
Availability:	Scientia Socialis Ltd. 29 K. Donelaicio Street, LT-78115 Siauliai, Republic of Lithuania. e-mail: scientia@scientiasocialis.lt; e-mail: mail.jbse@gmail.com; Web site: http://www.scientiasocialis.lt/jbse/
Peer Reviewed:	Y
Page Count:	21
Publication Date:	2025
Document Type:	Journal Articles Reports - Research
Education Level:	Secondary Education Higher Education Postsecondary Education
Descriptors:	Physics, Artificial Intelligence, Computer Software, Accuracy, Evaluators, Computational Linguistics, Science Education, Educational Assessment, Secondary School Students, Essays, Writing Evaluation, Interrater Reliability, Correlation, Scores, Preservice Teachers, Comparative Analysis, Foreign Countries
Geographic Terms:	China
ISSN:	1648-3898 2538-7138
Abstract:	As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative performance of ChatGPT4 and human graders in scoring upper-secondary physics essay questions. Eighty upper-secondary students' responses to two essay questions were evaluated by 30 pre-service teachers and ChatGPT4. The analysis highlighted their scoring consistency and accuracy, including intra-human comparisons, GPT grading at different times, human-GPT comparisons, and grading variations across cognitive categories. The intraclass correlation coefficient (ICC) was used to assess consistency, while accuracy was illustrated through Pearson correlation coefficient analysis with expert scores. The findings reveal that while ChatGPT4 demonstrated higher consistency in scoring, human scorers showed superior accuracy in most instances. These results underscore the strengths and limitations of using LLMs in educational assessments. The high consistency of LLMs can be valuable in standardizing assessments across diverse educational contexts, while the nuanced understanding and flexibility of human graders are irreplaceable in handling complex subjective evaluations.
Abstractor:	As Provided
Entry Date:	2025
Accession Number:	EJ1464000
Database:	ERIC

Full Text from ERIC

Description
Abstract:	As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative performance of ChatGPT4 and human graders in scoring upper-secondary physics essay questions. Eighty upper-secondary students' responses to two essay questions were evaluated by 30 pre-service teachers and ChatGPT4. The analysis highlighted their scoring consistency and accuracy, including intra-human comparisons, GPT grading at different times, human-GPT comparisons, and grading variations across cognitive categories. The intraclass correlation coefficient (ICC) was used to assess consistency, while accuracy was illustrated through Pearson correlation coefficient analysis with expert scores. The findings reveal that while ChatGPT4 demonstrated higher consistency in scoring, human scorers showed superior accuracy in most instances. These results underscore the strengths and limitations of using LLMs in educational assessments. The high consistency of LLMs can be valuable in standardizing assessments across diverse educational contexts, while the nuanced understanding and flexibility of human graders are irreplaceable in handling complex subjective evaluations.
ISSN:	1648-3898 2538-7138