Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset
Saved in:
| Title: | Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset |
|---|---|
| Language: | English |
| Authors: | Owen Henkel (ORCID |
| Source: | Journal of Learning Analytics. 2025 12(1):50-64. |
| Availability: | Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index |
| Peer Reviewed: | Y |
| Page Count: | 15 |
| Publication Date: | 2025 |
| Document Type: | Journal Articles Reports - Research |
| Education Level: | Junior High Schools Middle Schools Secondary Education High Schools |
| Descriptors: | Learning Analytics, Learning Management Systems, Mathematics Instruction, Middle School Students, High School Students, Computational Linguistics, Grading, Test Items, Mathematics Tests, Artificial Intelligence, Computer Software, Bayesian Statistics, Classification, Foreign Countries, Cues |
| Geographic Terms: | Nigeria, South Africa, Ghana, Africa |
| ISSN: | 1929-7750 |
| Abstract: | This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems. |
| Abstractor: | As Provided |
| Entry Date: | 2025 |
| Accession Number: | EJ1465703 |
| Database: | ERIC |
| FullText | Text: Availability: 0 CustomLinks: – Url: https://eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=EJ1465703 Name: ERIC Full Text Category: fullText Text: Full Text from ERIC |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: EJ1465703 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Owen+Henkel%22">Owen Henkel</searchLink> (ORCID <externalLink term="https://orcid.org/0009-0001-8850-067X">0009-0001-8850-067X</externalLink>)<br /><searchLink fieldCode="AR" term="%22Hannah+Horne-Robinson%22">Hannah Horne-Robinson</searchLink><br /><searchLink fieldCode="AR" term="%22Maria+Dyshel%22">Maria Dyshel</searchLink><br /><searchLink fieldCode="AR" term="%22Greg+Thompson%22">Greg Thompson</searchLink><br /><searchLink fieldCode="AR" term="%22Ralph+Abboud%22">Ralph Abboud</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-2332-0504">0000-0002-2332-0504</externalLink>)<br /><searchLink fieldCode="AR" term="%22Nabil+Al+Nahin+Ch%22">Nabil Al Nahin Ch</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0002-0202-1724">0000-0002-0202-1724</externalLink>)<br /><searchLink fieldCode="AR" term="%22Baptiste+Moreau-Pernet%22">Baptiste Moreau-Pernet</searchLink> (ORCID <externalLink term="https://orcid.org/0009-0006-9424-455X">0009-0006-9424-455X</externalLink>)<br /><searchLink fieldCode="AR" term="%22Kirk+Vanacore%22">Kirk Vanacore</searchLink> (ORCID <externalLink term="https://orcid.org/0000-0003-0673-5721">0000-0003-0673-5721</externalLink>) – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22Journal+of+Learning+Analytics%22"><i>Journal of Learning Analytics</i></searchLink>. 2025 12(1):50-64. – Name: Avail Label: Availability Group: Avail Data: Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: Y – Name: Pages Label: Page Count Group: Src Data: 15 – Name: DatePubCY Label: Publication Date Group: Date Data: 2025 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Journal Articles<br />Reports - Research – Name: Audience Label: Education Level Group: Audnce Data: <searchLink fieldCode="EL" term="%22Junior+High+Schools%22">Junior High Schools</searchLink><br /><searchLink fieldCode="EL" term="%22Middle+Schools%22">Middle Schools</searchLink><br /><searchLink fieldCode="EL" term="%22Secondary+Education%22">Secondary Education</searchLink><br /><searchLink fieldCode="EL" term="%22High+Schools%22">High Schools</searchLink> – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Learning+Analytics%22">Learning Analytics</searchLink><br /><searchLink fieldCode="DE" term="%22Learning+Management+Systems%22">Learning Management Systems</searchLink><br /><searchLink fieldCode="DE" term="%22Mathematics+Instruction%22">Mathematics Instruction</searchLink><br /><searchLink fieldCode="DE" term="%22Middle+School+Students%22">Middle School Students</searchLink><br /><searchLink fieldCode="DE" term="%22High+School+Students%22">High School Students</searchLink><br /><searchLink fieldCode="DE" term="%22Computational+Linguistics%22">Computational Linguistics</searchLink><br /><searchLink fieldCode="DE" term="%22Grading%22">Grading</searchLink><br /><searchLink fieldCode="DE" term="%22Test+Items%22">Test Items</searchLink><br /><searchLink fieldCode="DE" term="%22Mathematics+Tests%22">Mathematics Tests</searchLink><br /><searchLink fieldCode="DE" term="%22Artificial+Intelligence%22">Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Software%22">Computer Software</searchLink><br /><searchLink fieldCode="DE" term="%22Bayesian+Statistics%22">Bayesian Statistics</searchLink><br /><searchLink fieldCode="DE" term="%22Classification%22">Classification</searchLink><br /><searchLink fieldCode="DE" term="%22Foreign+Countries%22">Foreign Countries</searchLink><br /><searchLink fieldCode="DE" term="%22Cues%22">Cues</searchLink> – Name: Subject Label: Geographic Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Nigeria%22">Nigeria</searchLink><br /><searchLink fieldCode="DE" term="%22South+Africa%22">South Africa</searchLink><br /><searchLink fieldCode="DE" term="%22Ghana%22">Ghana</searchLink><br /><searchLink fieldCode="DE" term="%22Africa%22">Africa</searchLink> – Name: ISSN Label: ISSN Group: ISSN Data: 1929-7750 – Name: Abstract Label: Abstract Group: Ab Data: This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a mathematics learning platform used by middle and high school students in several African countries. Using this dataset, we conducted two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. In experiment 1, we use a variety of LLM-driven approaches, including zero-shot, fewshot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 97% of these edge cases, effectively boosting the overall accuracy of the grading from 96% to 97%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that modest improvements in model accuracy can lead to significant changes in mastery estimation. Where the rule-based classifier misclassified the mastery status of 6.9% of students across completed lessons, using the LLM chain-of-thought approach reduced this to 2.6%. These findings suggest that LLMs could be valuable for grading fill-in questions in mathematics education, potentially enabling wider adoption of open-response questions in learning systems. – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2025 – Name: AN Label: Accession Number Group: ID Data: EJ1465703 |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=EJ1465703 |
| RecordInfo | BibRecord: BibEntity: Languages: – Text: English PhysicalDescription: Pagination: PageCount: 15 StartPage: 50 Subjects: – SubjectFull: Learning Analytics Type: general – SubjectFull: Learning Management Systems Type: general – SubjectFull: Mathematics Instruction Type: general – SubjectFull: Middle School Students Type: general – SubjectFull: High School Students Type: general – SubjectFull: Computational Linguistics Type: general – SubjectFull: Grading Type: general – SubjectFull: Test Items Type: general – SubjectFull: Mathematics Tests Type: general – SubjectFull: Artificial Intelligence Type: general – SubjectFull: Computer Software Type: general – SubjectFull: Bayesian Statistics Type: general – SubjectFull: Classification Type: general – SubjectFull: Foreign Countries Type: general – SubjectFull: Cues Type: general – SubjectFull: Nigeria Type: general – SubjectFull: South Africa Type: general – SubjectFull: Ghana Type: general – SubjectFull: Africa Type: general Titles: – TitleFull: Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Owen Henkel – PersonEntity: Name: NameFull: Hannah Horne-Robinson – PersonEntity: Name: NameFull: Maria Dyshel – PersonEntity: Name: NameFull: Greg Thompson – PersonEntity: Name: NameFull: Ralph Abboud – PersonEntity: Name: NameFull: Nabil Al Nahin Ch – PersonEntity: Name: NameFull: Baptiste Moreau-Pernet – PersonEntity: Name: NameFull: Kirk Vanacore IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2025 Identifiers: – Type: issn-electronic Value: 1929-7750 Numbering: – Type: volume Value: 12 – Type: issue Value: 1 Titles: – TitleFull: Journal of Learning Analytics Type: main |
| ResultId | 1 |