Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments.
Saved in:
| Title: | Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments. |
|---|---|
| Authors: | Cheng, Yu1 (AUTHOR), Liu, Xixiang1,2 (AUTHOR) 101010902@seu.edu.cn, Chen, Shuai1,2 (AUTHOR), Xu, Chuan2 (AUTHOR) |
| Source: | Remote Sensing. May2026, Vol. 18 Issue 10, p1556. 24p. |
| Subjects: | K-means clustering, Dimensional reduction algorithms, Feature extraction, Wireless geolocation systems, Artificial neural networks, Search algorithms |
| Abstract: | Highlights: What are the main findings? We propose a Siamese-ViT-based aerial image–satellite image mapping architecture for real-time visual navigation. The backbone network uses ViT as the global feature extractor and implements a weight-sharing mechanism to improve the model's ability to handle cross-view geolocation. We employ a K-means-based local feature aggregation method to address the efficiency and matching complexity issues caused by high-dimensional features in visual scene matching tasks. This method aggregates local features into a fixed number of cluster centers, compressing high-dimensional features and effectively improving retrieval efficiency. To improve the real-time performance of matching, we use IPCA to reduce the dimensionality of the feature space. Furthermore, we construct a KD-tree structure from the feature vector library of satellite images, recursively transforming the high-dimensional space into a series of hyperrectangular regions for efficient searching. What are the implications of the main findings? To address the navigation challenges in complex terrain environments under satellite denial, our method integrates local fine-grained information with global semantic information and uses IPCA to reduce the dimensionality of the feature space, providing a feasible solution for autonomous flight of UAVs. To address the challenge of real-time visual navigation in real-world scenarios, we collected aerial photography data from various complex real-world environments to validate our algorithms, laying the foundation for the application of deep learning in intelligent positioning. Visual scene matching navigation (VSMN) for unmanned aerial vehicles (UAVs) boasts advantages such as high precision, high reliability, and autonomy. The biggest challenge lies in the tension between local fine-grained information and global semantics, as well as limited generalization ability in real-world environments. While existing Transformer-based cross-view geolocation methods enhance global context modeling capabilities, they still generally face issues such as high demands on training data and computational resources, insufficient fusion of local fine-grained information and global semantics, and real-time performance in real-world complex environment. To address these problems, we propose a scene matching and localization algorithm based on the Siamese-ViT. For feature extraction, we use the ViT model to extract global features and K-means clustering to aggregate local features. Combined with the global features extracted by the ViT, a robust local–global feature representation vector is generated. For feature matching, incremental principal component analysis (IPCA) is used to reduce the dimensionality of the high-dimensional feature space, and a KD-tree is constructed for fast feature retrieval to improve matching efficiency. We validated our algorithm on the University-1652 dataset and a dataset of real-world satellite-drone image pairs. The results show that our Siamese-ViT outperforms other models in both Recall and AP. We conduct flight experiments in real-world environments, capturing drone images of complex scenes, including farmland, urban buildings, and waterways. The results show that, at a flight altitude of 350 m, our algorithm achieves an average absolute value of 6.2063 m for latitude, 6.7552 m for longitude, and 10.1922 m for horizontal error. Therefore, our Siamese-ViT demonstrates ideal overall positioning accuracy. [ABSTRACT FROM AUTHOR] |
| Copyright of Remote Sensing is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
|
Full text is not displayed to guests.
Login for full access.
|
|
Be the first to leave a comment!