Graders of the Future: Comparing the Consistency and Accuracy of GPT4 and Pre-Service Teachers in Physics Essay Question Assessments
Saved in:
| Title: | Graders of the Future: Comparing the Consistency and Accuracy of GPT4 and Pre-Service Teachers in Physics Essay Question Assessments |
|---|---|
| Language: | English |
| Authors: | Yubin Xu (ORCID |
| Source: | Journal of Baltic Science Education. 2025 24(1):187-207. |
| Availability: | Scientia Socialis Ltd. 29 K. Donelaicio Street, LT-78115 Siauliai, Republic of Lithuania. e-mail: scientia@scientiasocialis.lt; e-mail: mail.jbse@gmail.com; Web site: http://www.scientiasocialis.lt/jbse/ |
| Peer Reviewed: | Y |
| Page Count: | 21 |
| Publication Date: | 2025 |
| Document Type: | Journal Articles Reports - Research |
| Education Level: | Secondary Education Higher Education Postsecondary Education |
| Descriptors: | Physics, Artificial Intelligence, Computer Software, Accuracy, Evaluators, Computational Linguistics, Science Education, Educational Assessment, Secondary School Students, Essays, Writing Evaluation, Interrater Reliability, Correlation, Scores, Preservice Teachers, Comparative Analysis, Foreign Countries |
| Geographic Terms: | China |
| ISSN: | 1648-3898 2538-7138 |
| Abstract: | As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative performance of ChatGPT4 and human graders in scoring upper-secondary physics essay questions. Eighty upper-secondary students' responses to two essay questions were evaluated by 30 pre-service teachers and ChatGPT4. The analysis highlighted their scoring consistency and accuracy, including intra-human comparisons, GPT grading at different times, human-GPT comparisons, and grading variations across cognitive categories. The intraclass correlation coefficient (ICC) was used to assess consistency, while accuracy was illustrated through Pearson correlation coefficient analysis with expert scores. The findings reveal that while ChatGPT4 demonstrated higher consistency in scoring, human scorers showed superior accuracy in most instances. These results underscore the strengths and limitations of using LLMs in educational assessments. The high consistency of LLMs can be valuable in standardizing assessments across diverse educational contexts, while the nuanced understanding and flexibility of human graders are irreplaceable in handling complex subjective evaluations. |
| Abstractor: | As Provided |
| Entry Date: | 2025 |
| Accession Number: | EJ1464000 |
| Database: | ERIC |
| Abstract: | As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative performance of ChatGPT4 and human graders in scoring upper-secondary physics essay questions. Eighty upper-secondary students' responses to two essay questions were evaluated by 30 pre-service teachers and ChatGPT4. The analysis highlighted their scoring consistency and accuracy, including intra-human comparisons, GPT grading at different times, human-GPT comparisons, and grading variations across cognitive categories. The intraclass correlation coefficient (ICC) was used to assess consistency, while accuracy was illustrated through Pearson correlation coefficient analysis with expert scores. The findings reveal that while ChatGPT4 demonstrated higher consistency in scoring, human scorers showed superior accuracy in most instances. These results underscore the strengths and limitations of using LLMs in educational assessments. The high consistency of LLMs can be valuable in standardizing assessments across diverse educational contexts, while the nuanced understanding and flexibility of human graders are irreplaceable in handling complex subjective evaluations. |
|---|---|
| ISSN: | 1648-3898 2538-7138 |