Speech recognition in adverse conditions: synthetic transformations and real environmental noise.

Saved in:
Bibliographic Details
Title: Speech recognition in adverse conditions: synthetic transformations and real environmental noise.
Authors: Katkov, Sergei1 (AUTHOR) sergei.katkov@student.unibz.it, Liotta, Antonio1 (AUTHOR) antonio.liotta@unibz.it, Vietti, Alessandro1 (AUTHOR) alessandro.vietti@unibz.it
Source: EURASIP Journal on Audio Speech & Music Processing. 4/10/2026, Vol. 2026 Issue 1, p1-26. 26p.
Subjects: Automatic speech recognition, Noise pollution, Data augmentation, Speech perception, Robust statistics
Abstract: While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios. [ABSTRACT FROM AUTHOR]
Copyright of EURASIP Journal on Audio Speech & Music Processing is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
Text:
  Availability: 1
Header DbId: egs
DbLabel: Engineering Source
An: 194004647
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Speech recognition in adverse conditions: synthetic transformations and real environmental noise.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Katkov%2C+Sergei%22">Katkov, Sergei</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> sergei.katkov@student.unibz.it</i><br /><searchLink fieldCode="AR" term="%22Liotta%2C+Antonio%22">Liotta, Antonio</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> antonio.liotta@unibz.it</i><br /><searchLink fieldCode="AR" term="%22Vietti%2C+Alessandro%22">Vietti, Alessandro</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> alessandro.vietti@unibz.it</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22EURASIP+Journal+on+Audio+Speech+%26+Music+Processing%22">EURASIP Journal on Audio Speech & Music Processing</searchLink>. 4/10/2026, Vol. 2026 Issue 1, p1-26. 26p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Automatic+speech+recognition%22">Automatic speech recognition</searchLink><br /><searchLink fieldCode="DE" term="%22Noise+pollution%22">Noise pollution</searchLink><br /><searchLink fieldCode="DE" term="%22Data+augmentation%22">Data augmentation</searchLink><br /><searchLink fieldCode="DE" term="%22Speech+perception%22">Speech perception</searchLink><br /><searchLink fieldCode="DE" term="%22Robust+statistics%22">Robust statistics</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of EURASIP Journal on Audio Speech & Music Processing is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=194004647
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1186/s13636-026-00458-1
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 26
        StartPage: 1
    Subjects:
      – SubjectFull: Automatic speech recognition
        Type: general
      – SubjectFull: Noise pollution
        Type: general
      – SubjectFull: Data augmentation
        Type: general
      – SubjectFull: Speech perception
        Type: general
      – SubjectFull: Robust statistics
        Type: general
    Titles:
      – TitleFull: Speech recognition in adverse conditions: synthetic transformations and real environmental noise.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Katkov, Sergei
      – PersonEntity:
          Name:
            NameFull: Liotta, Antonio
      – PersonEntity:
          Name:
            NameFull: Vietti, Alessandro
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 10
              M: 04
              Text: 4/10/2026
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 16874714
          Numbering:
            – Type: volume
              Value: 2026
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: EURASIP Journal on Audio Speech & Music Processing
              Type: main
ResultId 1