Generating In-Context, Personalized Feedback for Intelligent Tutors with Large Language Models
Saved in:
| Title: | Generating In-Context, Personalized Feedback for Intelligent Tutors with Large Language Models |
|---|---|
| Language: | English |
| Authors: | Jennifer M. Reddig, Arav Arora, Christopher J. MacLellan |
| Source: | International Journal of Artificial Intelligence in Education. 2025 35(6):3459-3500. |
| Availability: | Springer. Available from: Springer Nature. One New York Plaza, Suite 4600, New York, NY 10004. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-460-1700; e-mail: customerservice@springernature.com; Web site: https://link.springer.com/ |
| Peer Reviewed: | Y |
| Page Count: | 42 |
| Publication Date: | 2025 |
| Sponsoring Agency: | National Science Foundation (NSF) |
| Contract Number: | 2112532 |
| Document Type: | Journal Articles Reports - Research |
| Education Level: | Higher Education Postsecondary Education |
| Descriptors: | Intelligent Tutoring Systems, Artificial Intelligence, Feedback (Response), Error Correction, Accuracy, Evaluation, College Mathematics, Algebra, Models |
| DOI: | 10.1007/s40593-025-00505-6 |
| ISSN: | 1560-4292 1560-4306 |
| Abstract: | This study explores how large language models (LLMs), specifically GPT-4, could be used to generate personalized feedback within an Intelligent Tutoring System (ITS). The research focuses on evaluating the model's ability to (1) diagnose student errors, (2) generate personalized corrective feedback, and (3) assess the accuracy of diagnoses and helpfulness of the feedback. We analyze student errors from the Apprentice Tutor College Algebra ITS and prompt GPT-4 to give targeted feedback on those errors. The findings suggest that while this model can effectively diagnose a range of student errors, its feedback varies in effectiveness based on the complexity of the problem and the type of error. While GPT-4 generates relevant, specific feedback a majority of the time, 35% of the hints were too general, incorrect, or give away the correct answer. The study also explores methods for using an LLM to automatically evaluate the validity of generated feedback, and finds that only 35% of feedback passes automated helpfulness evaluations. |
| Abstractor: | As Provided |
| Entry Date: | 2026 |
| Accession Number: | EJ1500144 |
| Database: | ERIC |
Be the first to leave a comment!