Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset

Saved in:
Bibliographic Details
Title: Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset
Language: English
Authors: Owen Henkel (ORCID 0009-0001-8850-067X), Hannah Horne-Robinson, Maria Dyshel, Greg Thompson, Ralph Abboud (ORCID 0000-0002-2332-0504), Nabil Al Nahin Ch (ORCID 0000-0002-0202-1724), Baptiste Moreau-Pernet (ORCID 0009-0006-9424-455X), Kirk Vanacore (ORCID 0000-0003-0673-5721)
Source: Journal of Learning Analytics. 2025 12(1):50-64.
Availability: Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index
Peer Reviewed: Y
Page Count: 15
Publication Date: 2025
Document Type: Journal Articles
Reports - Research
Education Level: Junior High Schools
Middle Schools
Secondary Education
High Schools
Descriptors: Learning Analytics, Learning Management Systems, Mathematics Instruction, Middle School Students, High School Students, Computational Linguistics, Grading, Test Items, Mathematics Tests, Artificial Intelligence, Computer Software, Bayesian Statistics, Classification, Foreign Countries, Cues
Geographic Terms: Nigeria, South Africa, Ghana, Africa
ISSN: 1929-7750
Abstract: This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems.
Abstractor: As Provided
Entry Date: 2025
Accession Number: EJ1465703
Database: ERIC
Description
Abstract:This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems.
ISSN:1929-7750