MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms.
Saved in:
| Title: | MIFMNet: A Multimodal Interactions and Fusion Mamba for RGBT Tracking with UAV Platforms. |
|---|---|
| Authors: | Guo, Runze1 (AUTHOR), Sun, Xiaoyong1 (AUTHOR) sunxiaoyong14@nudt.edu.cn, Sun, Bei1 (AUTHOR), Qian, Hanxiang1 (AUTHOR), Dang, Zhaoyang1 (AUTHOR), Zhou, Peida1 (AUTHOR), Liu, Feiyang1 (AUTHOR), Su, Shaojing1 (AUTHOR) |
| Source: | Remote Sensing. Apr2026, Vol. 18 Issue 7, p1026. 23p. |
| Subjects: | Object tracking (Computer vision), Drone aircraft, Computer vision, Multisensor data fusion, Remote sensing |
| Abstract: | Highlights: What are the main findings? A novel multimodal interaction and fusion Mamba network (MIFMNet) is proposed for UAV RGBT tracking, featuring two core modules scale differential enhanced Mamba (SDEM) and flow-guided multilayer interaction Mamba (FMIM) that address the trade-off between interaction capability and computational efficiency in existing CNN/Transformer frameworks. MIFMNet achieves state-of-the-art performance on four mainstream RGBT benchmarks (LasHeR, RGBT210, RGBT234, VTUAV), with an inference speed of 35.3 FPS and superior robustness in UAV-specific challenges (scale variation, rapid motion, occlusion). What are the implications of the main findings? The scale differential enhancement and flow-guided motion-aware interaction mechanisms of MIFMNet provide an efficient solution for multimodal fusion in dynamic remote sensing observation scenarios with resource constraints. Extending Mamba to RGBT tracking verifies its potential for linear-complexity long-range modeling in multimodal vision tasks, offering a new architectural alternative to CNNs and Transformers for UAV computer vision applications. RGBT tracking holds irreplaceable value in unmanned aerial vehicle (UAV) ground observation missions, effectively supporting scenarios such as nighttime monitoring and low-altitude reconnaissance. However, existing frameworks based on CNNs or Transformers face inherent trade-offs between interaction capabilities and computational efficiency. Furthermore, current methods perform poorly in challenging scenarios involving target scale variations and rapid motion from UAV perspectives. To address these issues, this paper proposes a novel multimodal interaction and fusion Mamba network (MIFMNet), which achieves fundamental innovations relative to existing RGB-T fusion trackers and recent Mamba-based tracking methods. Different from existing RGB-T trackers that rely on CNN's local convolution or Transformer's quadratic-complexity self-attention for cross-modal fusion, MIFMNet departs from these architectures and designs modality-adaptive interaction mechanisms based on Mamba, fully leveraging the complementary information while resolving the efficiency-accuracy trade-off. Specifically, this paper designs the scale differential enhanced Mamba (SDEM), which expands the receptive field through multiscale parallel convolutions while amplifying complementary information via differential strategies to enhance feature responses to scale-varying objects. Furthermore, we propose flow-guided multilayer interaction Mamba (FMIM), which integrates inter-frame motion information into scanning prediction. This enables the network to adaptively adjust interaction priorities between shallow texture and high-level semantic features based on motion intensity, mitigating early information forgetting and enhancing robustness in dynamic scenes. Extensive experiments on four major benchmarks demonstrate that MIFMNet achieves state-of-the-art performance on precision and success rate, particularly excelling in UAV scenarios involving occlusion, scale variations, and rapid motion. Simultaneously, it achieves an inference speed of 35.3 FPS, enabling efficient deployment on resource-constrained platforms, thereby providing robust support for UAV applications of RGBT tracking. [ABSTRACT FROM AUTHOR] |
| Copyright of Remote Sensing is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
|
Full text is not displayed to guests.
Login for full access.
|
|
| Abstract: | Highlights: What are the main findings? A novel multimodal interaction and fusion Mamba network (MIFMNet) is proposed for UAV RGBT tracking, featuring two core modules scale differential enhanced Mamba (SDEM) and flow-guided multilayer interaction Mamba (FMIM) that address the trade-off between interaction capability and computational efficiency in existing CNN/Transformer frameworks. MIFMNet achieves state-of-the-art performance on four mainstream RGBT benchmarks (LasHeR, RGBT210, RGBT234, VTUAV), with an inference speed of 35.3 FPS and superior robustness in UAV-specific challenges (scale variation, rapid motion, occlusion). What are the implications of the main findings? The scale differential enhancement and flow-guided motion-aware interaction mechanisms of MIFMNet provide an efficient solution for multimodal fusion in dynamic remote sensing observation scenarios with resource constraints. Extending Mamba to RGBT tracking verifies its potential for linear-complexity long-range modeling in multimodal vision tasks, offering a new architectural alternative to CNNs and Transformers for UAV computer vision applications. RGBT tracking holds irreplaceable value in unmanned aerial vehicle (UAV) ground observation missions, effectively supporting scenarios such as nighttime monitoring and low-altitude reconnaissance. However, existing frameworks based on CNNs or Transformers face inherent trade-offs between interaction capabilities and computational efficiency. Furthermore, current methods perform poorly in challenging scenarios involving target scale variations and rapid motion from UAV perspectives. To address these issues, this paper proposes a novel multimodal interaction and fusion Mamba network (MIFMNet), which achieves fundamental innovations relative to existing RGB-T fusion trackers and recent Mamba-based tracking methods. Different from existing RGB-T trackers that rely on CNN's local convolution or Transformer's quadratic-complexity self-attention for cross-modal fusion, MIFMNet departs from these architectures and designs modality-adaptive interaction mechanisms based on Mamba, fully leveraging the complementary information while resolving the efficiency-accuracy trade-off. Specifically, this paper designs the scale differential enhanced Mamba (SDEM), which expands the receptive field through multiscale parallel convolutions while amplifying complementary information via differential strategies to enhance feature responses to scale-varying objects. Furthermore, we propose flow-guided multilayer interaction Mamba (FMIM), which integrates inter-frame motion information into scanning prediction. This enables the network to adaptively adjust interaction priorities between shallow texture and high-level semantic features based on motion intensity, mitigating early information forgetting and enhancing robustness in dynamic scenes. Extensive experiments on four major benchmarks demonstrate that MIFMNet achieves state-of-the-art performance on precision and success rate, particularly excelling in UAV scenarios involving occlusion, scale variations, and rapid motion. Simultaneously, it achieves an inference speed of 35.3 FPS, enabling efficient deployment on resource-constrained platforms, thereby providing robust support for UAV applications of RGBT tracking. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 20724292 |
| DOI: | 10.3390/rs18071026 |