FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.

Saved in:
Bibliographic Details
Title: FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.
Authors: Xu, Shiyun1 (AUTHOR), Zhang, Wenjie1 (AUTHOR), Cao, Yinghan1 (AUTHOR), Zhang, Zehua1 (AUTHOR), He, Changjun1 (AUTHOR), Wang, Mingjiang1 (AUTHOR) mjwang@hit.edu.cn
Source: Applied Acoustics. Dec2025, Vol. 240, pN.PAG-N.PAG. 1p.
Subjects: Speech enhancement, Transformer models, Multichannel communication, Noise control, Feature extraction
Abstract: Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top- k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR f w of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS P. 808 of 3.762, and an NISQA of 3.779 on real datasets. • We propose FSformer based on Transformer, which can sparsely and effectively learn the most critical features to enhance speech. • We introduce FSSA, which enhances global-local feature interaction and eliminates redundant features. • We present PGFN, which enhances feature extraction and reduces channel redundancy, addressing the limitations of FSSA. • The experimental results show that the denoising and dereverberation performance of FSformer is superior to other multi-channel SOTA models. [ABSTRACT FROM AUTHOR]
Copyright of Applied Acoustics is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 187264663
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Xu%2C+Shiyun%22">Xu, Shiyun</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Zhang%2C+Wenjie%22">Zhang, Wenjie</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Cao%2C+Yinghan%22">Cao, Yinghan</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Zhang%2C+Zehua%22">Zhang, Zehua</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22He%2C+Changjun%22">He, Changjun</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Wang%2C+Mingjiang%22">Wang, Mingjiang</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> mjwang@hit.edu.cn</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Applied+Acoustics%22">Applied Acoustics</searchLink>. Dec2025, Vol. 240, pN.PAG-N.PAG. 1p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Speech+enhancement%22">Speech enhancement</searchLink><br /><searchLink fieldCode="DE" term="%22Transformer+models%22">Transformer models</searchLink><br /><searchLink fieldCode="DE" term="%22Multichannel+communication%22">Multichannel communication</searchLink><br /><searchLink fieldCode="DE" term="%22Noise+control%22">Noise control</searchLink><br /><searchLink fieldCode="DE" term="%22Feature+extraction%22">Feature extraction</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top- k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR f w of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS P. 808 of 3.762, and an NISQA of 3.779 on real datasets. • We propose FSformer based on Transformer, which can sparsely and effectively learn the most critical features to enhance speech. • We introduce FSSA, which enhances global-local feature interaction and eliminates redundant features. • We present PGFN, which enhances feature extraction and reduces channel redundancy, addressing the limitations of FSSA. • The experimental results show that the denoising and dereverberation performance of FSformer is superior to other multi-channel SOTA models. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Applied Acoustics is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=187264663
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1016/j.apacoust.2025.110858
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 1
        StartPage: N.PAG
    Subjects:
      – SubjectFull: Speech enhancement
        Type: general
      – SubjectFull: Transformer models
        Type: general
      – SubjectFull: Multichannel communication
        Type: general
      – SubjectFull: Noise control
        Type: general
      – SubjectFull: Feature extraction
        Type: general
    Titles:
      – TitleFull: FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Xu, Shiyun
      – PersonEntity:
          Name:
            NameFull: Zhang, Wenjie
      – PersonEntity:
          Name:
            NameFull: Cao, Yinghan
      – PersonEntity:
          Name:
            NameFull: Zhang, Zehua
      – PersonEntity:
          Name:
            NameFull: He, Changjun
      – PersonEntity:
          Name:
            NameFull: Wang, Mingjiang
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 05
              M: 12
              Text: Dec2025
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-print
              Value: 0003682X
          Numbering:
            – Type: volume
              Value: 240
          Titles:
            – TitleFull: Applied Acoustics
              Type: main
ResultId 1