Performance of Large Language Models in Neurology Multiple‐Choice Questions.
Saved in:
| Title: | Performance of Large Language Models in Neurology Multiple‐Choice Questions. |
|---|---|
| Authors: | Habibi, Gholamreza (AUTHOR), Gargari, Omid Kohandel (AUTHOR), Hosseini, Mostafa (AUTHOR), Afchangi, Kasra (AUTHOR), Saleem, Suraiya (AUTHOR) |
| Source: | Acta Neurologica Scandinavica. 6/1/2026, Vol. 2026, p1-7. 7p. |
| Subjects: | Generative pre-trained transformers, Neurology, Artificial intelligence in medicine, Language models, Multiple choice examinations |
| Abstract: | Introduction: Navigating neurological disorders is complex due to overlapping symptoms and diverse diagnostic requirements. Large language models (LLMs) offer potential support for clinicians by processing vast amounts of textual data. This study evaluates the performance of four advanced LLMs—GPT‐4, GPT‐3.5, Clinical Camel, and MedALPACA—in answering neurology‐based multiple‐choice questions (MCQs). Methods: The study utilized 170 MCQs from the Comprehensive Review in Clinical Neurology book. Questions were divided into the body and answer choices, stored separately in an Excel spreadsheet. The models were prompted to select the correct answers. Generative Pretrained Transformer (GPT) models were accessed via OpenAI′s API, whereas Clinical Camel and MedALPACA were downloaded from Hugging Face. Accuracy was calculated by the number of correct answers over total questions, with subgroup analysis based on subject headings. Results: GPT‐4 achieved the highest accuracy at 84.7%, significantly outperforming GPT‐3.5 (58.8%), Clinical Camel (52.9%), and MedALPACA (40%). GPT‐4′s performance was significantly better (p < 0.001). The difference between GPT‐3.5 and Clinical Camel was not significant (p = 0.27), but both outperformed MedALPACA. Response similarity was highest between GPT‐4 and GPT‐3.5 (64.1%) and lowest between GPT‐4 and MedALPACA (41.2%). In subgroup analysis, GPT‐4 was superior across all topics, achieving full scores in six topics and its lowest in vascular neurology (40%). GPT‐3.5 performed best in eight topics, Clinical Camel in three topics, and MedALPACA was the weakest in all but two topics. Conclusion: GPT‐4 demonstrated the highest accuracy in answering neurology MCQs, outperforming GPT‐3.5, Clinical Camel, and MedALPACA. Clinical Camel′s comparable performance to GPT‐3.5 highlights the potential of specialized medical models. External evaluation datasets remain essential to avoid data leakage and ensure fair benchmarking. Further research is needed to expand topic coverage, assess reasoning processes, and include human comparison to support safe and effective clinical integration of medical LLMs. [ABSTRACT FROM AUTHOR] |
| Copyright of Acta Neurologica Scandinavica is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Psychology and Behavioral Sciences Collection |
|
Full text is not displayed to guests.
Login for full access.
|
|
| Abstract: | Introduction: Navigating neurological disorders is complex due to overlapping symptoms and diverse diagnostic requirements. Large language models (LLMs) offer potential support for clinicians by processing vast amounts of textual data. This study evaluates the performance of four advanced LLMs—GPT‐4, GPT‐3.5, Clinical Camel, and MedALPACA—in answering neurology‐based multiple‐choice questions (MCQs). Methods: The study utilized 170 MCQs from the Comprehensive Review in Clinical Neurology book. Questions were divided into the body and answer choices, stored separately in an Excel spreadsheet. The models were prompted to select the correct answers. Generative Pretrained Transformer (GPT) models were accessed via OpenAI′s API, whereas Clinical Camel and MedALPACA were downloaded from Hugging Face. Accuracy was calculated by the number of correct answers over total questions, with subgroup analysis based on subject headings. Results: GPT‐4 achieved the highest accuracy at 84.7%, significantly outperforming GPT‐3.5 (58.8%), Clinical Camel (52.9%), and MedALPACA (40%). GPT‐4′s performance was significantly better (p < 0.001). The difference between GPT‐3.5 and Clinical Camel was not significant (p = 0.27), but both outperformed MedALPACA. Response similarity was highest between GPT‐4 and GPT‐3.5 (64.1%) and lowest between GPT‐4 and MedALPACA (41.2%). In subgroup analysis, GPT‐4 was superior across all topics, achieving full scores in six topics and its lowest in vascular neurology (40%). GPT‐3.5 performed best in eight topics, Clinical Camel in three topics, and MedALPACA was the weakest in all but two topics. Conclusion: GPT‐4 demonstrated the highest accuracy in answering neurology MCQs, outperforming GPT‐3.5, Clinical Camel, and MedALPACA. Clinical Camel′s comparable performance to GPT‐3.5 highlights the potential of specialized medical models. External evaluation datasets remain essential to avoid data leakage and ensure fair benchmarking. Further research is needed to expand topic coverage, assess reasoning processes, and include human comparison to support safe and effective clinical integration of medical LLMs. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 00016314 |
| DOI: | 10.1155/ane/5623086 |