Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.

Saved in:
Bibliographic Details
Title: Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.
Authors: Bhagat, Sudesh1 (AUTHOR) bhagat@iastate.edu, Kandiboina, Raghupathi2 (AUTHOR) Raghupathi.Kandiboina@uwyo.edu, Shihab, Ibne Farabi3 (AUTHOR) ishihab@iastate.edu, Knickerbocker, Skylar4 (AUTHOR) sknick@iastate.edu, Hawkins, Neal4 (AUTHOR) hawkins@iastate.edu, Sharma, Anuj1 (AUTHOR) anujs@iastate.edu
Source: Journal of Safety Research. Jun2026, Vol. 97, p777-791. 15p.
Subjects: Natural language processing, Data quality, Drinking & traffic accidents, Geographic spatial analysis, Classification, Regression analysis, Language models, Traffic safety
Geographic Terms: Iowa
Abstract: Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR]
Copyright of Journal of Safety Research is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 194574421
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Bhagat%2C+Sudesh%22">Bhagat, Sudesh</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> bhagat@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Kandiboina%2C+Raghupathi%22">Kandiboina, Raghupathi</searchLink><relatesTo>2</relatesTo> (AUTHOR)<i> Raghupathi.Kandiboina@uwyo.edu</i><br /><searchLink fieldCode="AR" term="%22Shihab%2C+Ibne+Farabi%22">Shihab, Ibne Farabi</searchLink><relatesTo>3</relatesTo> (AUTHOR)<i> ishihab@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Knickerbocker%2C+Skylar%22">Knickerbocker, Skylar</searchLink><relatesTo>4</relatesTo> (AUTHOR)<i> sknick@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Hawkins%2C+Neal%22">Hawkins, Neal</searchLink><relatesTo>4</relatesTo> (AUTHOR)<i> hawkins@iastate.edu</i><br /><searchLink fieldCode="AR" term="%22Sharma%2C+Anuj%22">Sharma, Anuj</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> anujs@iastate.edu</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Journal+of+Safety+Research%22">Journal of Safety Research</searchLink>. Jun2026, Vol. 97, p777-791. 15p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Natural+language+processing%22">Natural language processing</searchLink><br /><searchLink fieldCode="DE" term="%22Data+quality%22">Data quality</searchLink><br /><searchLink fieldCode="DE" term="%22Drinking+%26+traffic+accidents%22">Drinking & traffic accidents</searchLink><br /><searchLink fieldCode="DE" term="%22Geographic+spatial+analysis%22">Geographic spatial analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Classification%22">Classification</searchLink><br /><searchLink fieldCode="DE" term="%22Regression+analysis%22">Regression analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Language+models%22">Language models</searchLink><br /><searchLink fieldCode="DE" term="%22Traffic+safety%22">Traffic safety</searchLink>
– Name: SubjectGeographic
  Label: Geographic Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Iowa%22">Iowa</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Introduction: Road traffic crashes remain a leading cause of fatalities worldwide, underscoring the need for accurate data to guide prevention strategies and evidence-based policymaking. However, crash databases often suffer from misclassification, underreporting, and inconsistencies, particularly in alcohol-involved cases, which limits the reliability of safety analyses. Method: This study addresses this issue by identifying and quantifying Misclassified Alcohol-Involved Crashes (MAICs) using a Natural Language Processing (NLP) framework based on the BERT model. The framework analyzed 371,062 crash records from Iowa (2016-2022) and identified 3,895 misclassified alcohol-involved crashes (MAICs) out of 19,177 alcohol-involved cases predicted by the model, corresponding to an overall misclassification rate of 20.35% and a confidence interval of 18.86%-21.85%. To examine the factors contributing to these errors, a mixed-effects Probit Logit regression model was applied, incorporating behavioral, environmental, and roadway attributes. Results: Results indicated that fatal and nighttime crashes were less likely to be misclassified, whereas crashes involving older or younger drivers, heavy trucks, and vulnerable road users showed higher odds of misclassification. A Local Indicators of Spatial Association (LISA) analysis revealed significant county-level clusters of misclassifications, suggesting regional differences in enforcement and reporting practices. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Journal of Safety Research is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=194574421
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1016/j.jsr.2026.05.007
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 15
        StartPage: 777
    Subjects:
      – SubjectFull: Natural language processing
        Type: general
      – SubjectFull: Data quality
        Type: general
      – SubjectFull: Drinking & traffic accidents
        Type: general
      – SubjectFull: Geographic spatial analysis
        Type: general
      – SubjectFull: Classification
        Type: general
      – SubjectFull: Regression analysis
        Type: general
      – SubjectFull: Language models
        Type: general
      – SubjectFull: Traffic safety
        Type: general
      – SubjectFull: Iowa
        Type: general
    Titles:
      – TitleFull: Improving crash data quality by identifying misclassified alcohol-involved crashes using NLP on narrative data.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Bhagat, Sudesh
      – PersonEntity:
          Name:
            NameFull: Kandiboina, Raghupathi
      – PersonEntity:
          Name:
            NameFull: Shihab, Ibne Farabi
      – PersonEntity:
          Name:
            NameFull: Knickerbocker, Skylar
      – PersonEntity:
          Name:
            NameFull: Hawkins, Neal
      – PersonEntity:
          Name:
            NameFull: Sharma, Anuj
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 06
              Text: Jun2026
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 00224375
          Numbering:
            – Type: volume
              Value: 97
          Titles:
            – TitleFull: Journal of Safety Research
              Type: main
ResultId 1