Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.
Saved in:
| Title: | Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data. |
|---|---|
| Authors: | Bhagat, Sudesh1 (AUTHOR) bhagat@iastate.edu, Kandiboina, Raghupathi2 (AUTHOR) Raghupathi.Kandiboina@uwyo.edu, Shihab, Ibne Farabi3 (AUTHOR) ishihab@iastate.edu, Knickerbocker, Skylar4 (AUTHOR) sknick@iastate.edu, Hawkins, Neal4 (AUTHOR) hawkins@iastate.edu, Sharma, Anuj1 (AUTHOR) anujs@iastate.edu |
| Source: | Journal of Safety Research. Jun2026, Vol. 97, p777-791. 15p. |
| Subjects: | Natural language processing, Data quality, Drinking & traffic accidents, Geographic spatial analysis, Classification, Regression analysis, Language models, Traffic safety |
| Geographic Terms: | Iowa |
| Abstract: | Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR] |
| Copyright of Journal of Safety Research is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
| FullText | Text: Availability: 0 |
|---|---|
| Header | DbId: egs DbLabel: Engineering Source An: 194574421 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Bhagat%2C+Sudesh%22">Bhagat, Sudesh</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> bhagat@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Kandiboina%2C+Raghupathi%22">Kandiboina, Raghupathi</searchLink><relatesTo>2</relatesTo> (AUTHOR)<i> Raghupathi.Kandiboina@uwyo.edu</i><br /><searchLink fieldCode="AR" term="%22Shihab%2C+Ibne+Farabi%22">Shihab, Ibne Farabi</searchLink><relatesTo>3</relatesTo> (AUTHOR)<i> ishihab@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Knickerbocker%2C+Skylar%22">Knickerbocker, Skylar</searchLink><relatesTo>4</relatesTo> (AUTHOR)<i> sknick@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Hawkins%2C+Neal%22">Hawkins, Neal</searchLink><relatesTo>4</relatesTo> (AUTHOR)<i> hawkins@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Sharma%2C+Anuj%22">Sharma, Anuj</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> anujs@iastate.edu</i> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22Journal+of+Safety+Research%22">Journal of Safety Research</searchLink>. Jun2026, Vol. 97, p777-791. 15p. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Natural+language+processing%22">Natural language processing</searchLink><br /><searchLink fieldCode="DE" term="%22Data+quality%22">Data quality</searchLink><br /><searchLink fieldCode="DE" term="%22Drinking+%26+traffic+accidents%22">Drinking & traffic accidents</searchLink><br /><searchLink fieldCode="DE" term="%22Geographic+spatial+analysis%22">Geographic spatial analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Classification%22">Classification</searchLink><br /><searchLink fieldCode="DE" term="%22Regression+analysis%22">Regression analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Language+models%22">Language models</searchLink><br /><searchLink fieldCode="DE" term="%22Traffic+safety%22">Traffic safety</searchLink> – Name: SubjectGeographic Label: Geographic Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Iowa%22">Iowa</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of Journal of Safety Research is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.) |
| PLink | https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=194574421 |
| RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1016/j.jsr.2026.05.007 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 15 StartPage: 777 Subjects: – SubjectFull: Natural language processing Type: general – SubjectFull: Data quality Type: general – SubjectFull: Drinking & traffic accidents Type: general – SubjectFull: Geographic spatial analysis Type: general – SubjectFull: Classification Type: general – SubjectFull: Regression analysis Type: general – SubjectFull: Language models Type: general – SubjectFull: Traffic safety Type: general – SubjectFull: Iowa Type: general Titles: – TitleFull: Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Bhagat, Sudesh – PersonEntity: Name: NameFull: Kandiboina, Raghupathi – PersonEntity: Name: NameFull: Shihab, Ibne Farabi – PersonEntity: Name: NameFull: Knickerbocker, Skylar – PersonEntity: Name: NameFull: Hawkins, Neal – PersonEntity: Name: NameFull: Sharma, Anuj IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 06 Text: Jun2026 Type: published Y: 2026 Identifiers: – Type: issn-print Value: 00224375 Numbering: – Type: volume Value: 97 Titles: – TitleFull: Journal of Safety Research Type: main |
| ResultId | 1 |