Generating In-Context, Personalized Feedback for Intelligent Tutors with Large Language Models

Saved in:
Bibliographic Details
Title: Generating In-Context, Personalized Feedback for Intelligent Tutors with Large Language Models
Language: English
Authors: Jennifer M. Reddig, Arav Arora, Christopher J. MacLellan
Source: International Journal of Artificial Intelligence in Education. 2025 35(6):3459-3500.
Availability: Springer. Available from: Springer Nature. One New York Plaza, Suite 4600, New York, NY 10004. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-460-1700; e-mail: customerservice@springernature.com; Web site: https://link.springer.com/
Peer Reviewed: Y
Page Count: 42
Publication Date: 2025
Sponsoring Agency: National Science Foundation (NSF)
Contract Number: 2112532
Document Type: Journal Articles
Reports - Research
Education Level: Higher Education
Postsecondary Education
Descriptors: Intelligent Tutoring Systems, Artificial Intelligence, Feedback (Response), Error Correction, Accuracy, Evaluation, College Mathematics, Algebra, Models
DOI: 10.1007/s40593-025-00505-6
ISSN: 1560-4292
1560-4306
Abstract: This study explores how large language models (LLMs), specifically GPT-4, could be used to generate personalized feedback within an Intelligent Tutoring System (ITS). The research focuses on evaluating the model's ability to (1) diagnose student errors, (2) generate personalized corrective feedback, and (3) assess the accuracy of diagnoses and helpfulness of the feedback. We analyze student errors from the Apprentice Tutor College Algebra ITS and prompt GPT-4 to give targeted feedback on those errors. The findings suggest that while this model can effectively diagnose a range of student errors, its feedback varies in effectiveness based on the complexity of the problem and the type of error. While GPT-4 generates relevant, specific feedback a majority of the time, 35% of the hints were too general, incorrect, or give away the correct answer. The study also explores methods for using an LLM to automatically evaluate the validity of generated feedback, and finds that only 35% of feedback passes automated helpfulness evaluations.
Abstractor: As Provided
Entry Date: 2026
Accession Number: EJ1500144
Database: ERIC
Description
Abstract:This study explores how large language models (LLMs), specifically GPT-4, could be used to generate personalized feedback within an Intelligent Tutoring System (ITS). The research focuses on evaluating the model's ability to (1) diagnose student errors, (2) generate personalized corrective feedback, and (3) assess the accuracy of diagnoses and helpfulness of the feedback. We analyze student errors from the Apprentice Tutor College Algebra ITS and prompt GPT-4 to give targeted feedback on those errors. The findings suggest that while this model can effectively diagnose a range of student errors, its feedback varies in effectiveness based on the complexity of the problem and the type of error. While GPT-4 generates relevant, specific feedback a majority of the time, 35% of the hints were too general, incorrect, or give away the correct answer. The study also explores methods for using an LLM to automatically evaluate the validity of generated feedback, and finds that only 35% of feedback passes automated helpfulness evaluations.
ISSN:1560-4292
1560-4306
DOI:10.1007/s40593-025-00505-6