View in EDS

Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.

Saved in:

Bibliographic Details
Title:	Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.
Authors:	Bhagat, Sudesh¹ (AUTHOR) bhagat@iastate.edu, Kandiboina, Raghupathi² (AUTHOR) Raghupathi.Kandiboina@uwyo.edu, Shihab, Ibne Farabi³ (AUTHOR) ishihab@iastate.edu, Knickerbocker, Skylar⁴ (AUTHOR) sknick@iastate.edu, Hawkins, Neal⁴ (AUTHOR) hawkins@iastate.edu, Sharma, Anuj¹ (AUTHOR) anujs@iastate.edu
Source:	Journal of Safety Research. Jun2026, Vol. 97, p777-791. 15p.
Subjects:	Natural language processing, Data quality, Drinking & traffic accidents, Geographic spatial analysis, Classification, Regression analysis, Language models, Traffic safety
Geographic Terms:	Iowa
Abstract:	Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR]
	Copyright of Journal of Safety Research is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

Description
Abstract:	Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR]
ISSN:	00224375
DOI:	10.1016/j.jsr.2026.05.007