View in EDS

FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.

Saved in:

Bibliographic Details
Title:	FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement.
Authors:	Xu, Shiyun¹ (AUTHOR), Zhang, Wenjie¹ (AUTHOR), Cao, Yinghan¹ (AUTHOR), Zhang, Zehua¹ (AUTHOR), He, Changjun¹ (AUTHOR), Wang, Mingjiang¹ (AUTHOR) mjwang@hit.edu.cn
Source:	Applied Acoustics. Dec2025, Vol. 240, pN.PAG-N.PAG. 1p.
Subjects:	Speech enhancement, Transformer models, Multichannel communication, Noise control, Feature extraction
Abstract:	Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top- k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR f w of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS P. 808 of 3.762, and an NISQA of 3.779 on real datasets. • We propose FSformer based on Transformer, which can sparsely and effectively learn the most critical features to enhance speech. • We introduce FSSA, which enhances global-local feature interaction and eliminates redundant features. • We present PGFN, which enhances feature extraction and reduces channel redundancy, addressing the limitations of FSSA. • The experimental results show that the denoising and dereverberation performance of FSformer is superior to other multi-channel SOTA models. [ABSTRACT FROM AUTHOR]
	Copyright of Applied Acoustics is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

FullText	Text: Availability: 0
Header	DbId: egs DbLabel: Engineering Source An: 187264663 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Xu%2C+Shiyun%22">Xu, Shiyun</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Zhang%2C+Wenjie%22">Zhang, Wenjie</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Cao%2C+Yinghan%22">Cao, Yinghan</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Zhang%2C+Zehua%22">Zhang, Zehua</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22He%2C+Changjun%22">He, Changjun</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Wang%2C+Mingjiang%22">Wang, Mingjiang</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> mjwang@hit.edu.cn</i> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22Applied+Acoustics%22">Applied Acoustics</searchLink>. Dec2025, Vol. 240, pN.PAG-N.PAG. 1p. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Speech+enhancement%22">Speech enhancement</searchLink><br /><searchLink fieldCode="DE" term="%22Transformer+models%22">Transformer models</searchLink><br /><searchLink fieldCode="DE" term="%22Multichannel+communication%22">Multichannel communication</searchLink><br /><searchLink fieldCode="DE" term="%22Noise+control%22">Noise control</searchLink><br /><searchLink fieldCode="DE" term="%22Feature+extraction%22">Feature extraction</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top- k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR f w of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS P. 808 of 3.762, and an NISQA of 3.779 on real datasets. • We propose FSformer based on Transformer, which can sparsely and effectively learn the most critical features to enhance speech. • We introduce FSSA, which enhances global-local feature interaction and eliminates redundant features. • We present PGFN, which enhances feature extraction and reduces channel redundancy, addressing the limitations of FSSA. • The experimental results show that the denoising and dereverberation performance of FSformer is superior to other multi-channel SOTA models. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of Applied Acoustics is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=187264663
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1016/j.apacoust.2025.110858 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 1 StartPage: N.PAG Subjects: – SubjectFull: Speech enhancement Type: general – SubjectFull: Transformer models Type: general – SubjectFull: Multichannel communication Type: general – SubjectFull: Noise control Type: general – SubjectFull: Feature extraction Type: general Titles: – TitleFull: FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Xu, Shiyun – PersonEntity: Name: NameFull: Zhang, Wenjie – PersonEntity: Name: NameFull: Cao, Yinghan – PersonEntity: Name: NameFull: Zhang, Zehua – PersonEntity: Name: NameFull: He, Changjun – PersonEntity: Name: NameFull: Wang, Mingjiang IsPartOfRelationships: – BibEntity: Dates: – D: 05 M: 12 Text: Dec2025 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 0003682X Numbering: – Type: volume Value: 240 Titles: – TitleFull: Applied Acoustics Type: main
ResultId	1