Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset
Saved in:
| Title: | Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset |
|---|---|
| Language: | English |
| Authors: | Owen Henkel (ORCID |
| Source: | Journal of Learning Analytics. 2025 12(1):50-64. |
| Availability: | Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index |
| Peer Reviewed: | Y |
| Page Count: | 15 |
| Publication Date: | 2025 |
| Document Type: | Journal Articles Reports - Research |
| Education Level: | Junior High Schools Middle Schools Secondary Education High Schools |
| Descriptors: | Learning Analytics, Learning Management Systems, Mathematics Instruction, Middle School Students, High School Students, Computational Linguistics, Grading, Test Items, Mathematics Tests, Artificial Intelligence, Computer Software, Bayesian Statistics, Classification, Foreign Countries, Cues |
| Geographic Terms: | Nigeria, South Africa, Ghana, Africa |
| ISSN: | 1929-7750 |
| Abstract: | This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems. |
| Abstractor: | As Provided |
| Entry Date: | 2025 |
| Accession Number: | EJ1465703 |
| Database: | ERIC |
| Abstract: | This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems. |
|---|---|
| ISSN: | 1929-7750 |