MLFormer: Linear-Attention Transformer with Multi-Scale Feature Enhancement for Real-Time Monocular 3D Human Pose Estimation.

Saved in:

Bibliographic Details
Title:	MLFormer: Linear-Attention Transformer with Multi-Scale Feature Enhancement for Real-Time Monocular 3D Human Pose Estimation.
Authors:	Wang, Bowen¹ 670144302@qq.com, Wang, Shiwen² 2042798189@qq.com, Zhou, Ziwei³ 381431970@qq.com
Source:	Engineering Letters. Apr2026, Vol. 34 Issue 4, p1385-1394. 10p.
Subjects:	Transformer models, Real-time computing, Feature extraction, Computer vision, Artificial intelligence
Abstract:	With the rapid development of artificial intelligence, monocular human pose estimation has become increasingly prominent in the field of computer vision. It holds broad application prospects in areas such as intelligent sports, medical rehabilitation, and human--computer interaction. Nevertheless, existing monocular approaches still suffer from limited accuracy, real-time performance, and adaptability. To address these issues, we propose MLFormer, a Transformer-based architecture that integrates a linear attention mechanism with a Multi-scale Feature Enhancement Module (MFEM). This design significantly reduces computational complexity while improving both accuracy and inference speed. Evaluated on Human3.6M, MLFormer achieves an MPJPE of 42.1 mm; on MPI-INF-3DHP it attains 94.6% PCK, 67.1% AUC, and 53.8 mm MPJPE, surpassing state-of-the-art methods on all metrics. Extensive experiments demonstrate that MLFormer retains high precision, offers stronger real-time capability, and exhibits superior adaptability to human poses at varying scales, together with robustness and generalizability. Overall, the proposed model delivers an efficient solution for monocular human pose estimation, providing notable improvements in accuracy, real-time performance, and adaptability. [ABSTRACT FROM AUTHOR]
	Copyright of Engineering Letters is the property of International Association of Engineers (IAENG) and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

Description
Abstract:	With the rapid development of artificial intelligence, monocular human pose estimation has become increasingly prominent in the field of computer vision. It holds broad application prospects in areas such as intelligent sports, medical rehabilitation, and human--computer interaction. Nevertheless, existing monocular approaches still suffer from limited accuracy, real-time performance, and adaptability. To address these issues, we propose MLFormer, a Transformer-based architecture that integrates a linear attention mechanism with a Multi-scale Feature Enhancement Module (MFEM). This design significantly reduces computational complexity while improving both accuracy and inference speed. Evaluated on Human3.6M, MLFormer achieves an MPJPE of 42.1 mm; on MPI-INF-3DHP it attains 94.6% PCK, 67.1% AUC, and 53.8 mm MPJPE, surpassing state-of-the-art methods on all metrics. Extensive experiments demonstrate that MLFormer retains high precision, offers stronger real-time capability, and exhibits superior adaptability to human poses at varying scales, together with robustness and generalizability. Overall, the proposed model delivers an efficient solution for monocular human pose estimation, providing notable improvements in accuracy, real-time performance, and adaptability. [ABSTRACT FROM AUTHOR]
ISSN:	1816093X