Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments.

Saved in:
Bibliographic Details
Title: Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments.
Authors: Cheng, Yu1 (AUTHOR), Liu, Xixiang1,2 (AUTHOR) 101010902@seu.edu.cn, Chen, Shuai1,2 (AUTHOR), Xu, Chuan2 (AUTHOR)
Source: Remote Sensing. May2026, Vol. 18 Issue 10, p1556. 24p.
Subjects: K-means clustering, Dimensional reduction algorithms, Feature extraction, Wireless geolocation systems, Artificial neural networks, Search algorithms
Abstract: Highlights: What are the main findings? We propose a Siamese-ViT-based aerial image–satellite image mapping architecture for real-time visual navigation. The backbone network uses ViT as the global feature extractor and implements a weight-sharing mechanism to improve the model's ability to handle cross-view geolocation. We employ a K-means-based local feature aggregation method to address the efficiency and matching complexity issues caused by high-dimensional features in visual scene matching tasks. This method aggregates local features into a fixed number of cluster centers, compressing high-dimensional features and effectively improving retrieval efficiency. To improve the real-time performance of matching, we use IPCA to reduce the dimensionality of the feature space. Furthermore, we construct a KD-tree structure from the feature vector library of satellite images, recursively transforming the high-dimensional space into a series of hyperrectangular regions for efficient searching. What are the implications of the main findings? To address the navigation challenges in complex terrain environments under satellite denial, our method integrates local fine-grained information with global semantic information and uses IPCA to reduce the dimensionality of the feature space, providing a feasible solution for autonomous flight of UAVs. To address the challenge of real-time visual navigation in real-world scenarios, we collected aerial photography data from various complex real-world environments to validate our algorithms, laying the foundation for the application of deep learning in intelligent positioning. Visual scene matching navigation (VSMN) for unmanned aerial vehicles (UAVs) boasts advantages such as high precision, high reliability, and autonomy. The biggest challenge lies in the tension between local fine-grained information and global semantics, as well as limited generalization ability in real-world environments. While existing Transformer-based cross-view geolocation methods enhance global context modeling capabilities, they still generally face issues such as high demands on training data and computational resources, insufficient fusion of local fine-grained information and global semantics, and real-time performance in real-world complex environment. To address these problems, we propose a scene matching and localization algorithm based on the Siamese-ViT. For feature extraction, we use the ViT model to extract global features and K-means clustering to aggregate local features. Combined with the global features extracted by the ViT, a robust local–global feature representation vector is generated. For feature matching, incremental principal component analysis (IPCA) is used to reduce the dimensionality of the high-dimensional feature space, and a KD-tree is constructed for fast feature retrieval to improve matching efficiency. We validated our algorithm on the University-1652 dataset and a dataset of real-world satellite-drone image pairs. The results show that our Siamese-ViT outperforms other models in both Recall and AP. We conduct flight experiments in real-world environments, capturing drone images of complex scenes, including farmland, urban buildings, and waterways. The results show that, at a flight altitude of 350 m, our algorithm achieves an average absolute value of 6.2063 m for latitude, 6.7552 m for longitude, and 10.1922 m for horizontal error. Therefore, our Siamese-ViT demonstrates ideal overall positioning accuracy. [ABSTRACT FROM AUTHOR]
Copyright of Remote Sensing is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Full text is not displayed to guests.
FullText Links:
  – Type: pdflink
Text:
  Availability: 1
Header DbId: egs
DbLabel: Engineering Source
An: 194141081
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Cheng%2C+Yu%22">Cheng, Yu</searchLink><relatesTo>1</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Liu%2C+Xixiang%22">Liu, Xixiang</searchLink><relatesTo>1,2</relatesTo> (AUTHOR)<i> 101010902@seu.edu.cn</i><br /><searchLink fieldCode="AR" term="%22Chen%2C+Shuai%22">Chen, Shuai</searchLink><relatesTo>1,2</relatesTo> (AUTHOR)<br /><searchLink fieldCode="AR" term="%22Xu%2C+Chuan%22">Xu, Chuan</searchLink><relatesTo>2</relatesTo> (AUTHOR)
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Remote+Sensing%22">Remote Sensing</searchLink>. May2026, Vol. 18 Issue 10, p1556. 24p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22K-means+clustering%22">K-means clustering</searchLink><br /><searchLink fieldCode="DE" term="%22Dimensional+reduction+algorithms%22">Dimensional reduction algorithms</searchLink><br /><searchLink fieldCode="DE" term="%22Feature+extraction%22">Feature extraction</searchLink><br /><searchLink fieldCode="DE" term="%22Wireless+geolocation+systems%22">Wireless geolocation systems</searchLink><br /><searchLink fieldCode="DE" term="%22Artificial+neural+networks%22">Artificial neural networks</searchLink><br /><searchLink fieldCode="DE" term="%22Search+algorithms%22">Search algorithms</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Highlights: What are the main findings? We propose a Siamese-ViT-based aerial image–satellite image mapping architecture for real-time visual navigation. The backbone network uses ViT as the global feature extractor and implements a weight-sharing mechanism to improve the model's ability to handle cross-view geolocation. We employ a K-means-based local feature aggregation method to address the efficiency and matching complexity issues caused by high-dimensional features in visual scene matching tasks. This method aggregates local features into a fixed number of cluster centers, compressing high-dimensional features and effectively improving retrieval efficiency. To improve the real-time performance of matching, we use IPCA to reduce the dimensionality of the feature space. Furthermore, we construct a KD-tree structure from the feature vector library of satellite images, recursively transforming the high-dimensional space into a series of hyperrectangular regions for efficient searching. What are the implications of the main findings? To address the navigation challenges in complex terrain environments under satellite denial, our method integrates local fine-grained information with global semantic information and uses IPCA to reduce the dimensionality of the feature space, providing a feasible solution for autonomous flight of UAVs. To address the challenge of real-time visual navigation in real-world scenarios, we collected aerial photography data from various complex real-world environments to validate our algorithms, laying the foundation for the application of deep learning in intelligent positioning. Visual scene matching navigation (VSMN) for unmanned aerial vehicles (UAVs) boasts advantages such as high precision, high reliability, and autonomy. The biggest challenge lies in the tension between local fine-grained information and global semantics, as well as limited generalization ability in real-world environments. While existing Transformer-based cross-view geolocation methods enhance global context modeling capabilities, they still generally face issues such as high demands on training data and computational resources, insufficient fusion of local fine-grained information and global semantics, and real-time performance in real-world complex environment. To address these problems, we propose a scene matching and localization algorithm based on the Siamese-ViT. For feature extraction, we use the ViT model to extract global features and K-means clustering to aggregate local features. Combined with the global features extracted by the ViT, a robust local–global feature representation vector is generated. For feature matching, incremental principal component analysis (IPCA) is used to reduce the dimensionality of the high-dimensional feature space, and a KD-tree is constructed for fast feature retrieval to improve matching efficiency. We validated our algorithm on the University-1652 dataset and a dataset of real-world satellite-drone image pairs. The results show that our Siamese-ViT outperforms other models in both Recall and AP. We conduct flight experiments in real-world environments, capturing drone images of complex scenes, including farmland, urban buildings, and waterways. The results show that, at a flight altitude of 350 m, our algorithm achieves an average absolute value of 6.2063 m for latitude, 6.7552 m for longitude, and 10.1922 m for horizontal error. Therefore, our Siamese-ViT demonstrates ideal overall positioning accuracy. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Remote Sensing is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=194141081
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.3390/rs18101556
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 24
        StartPage: 1556
    Subjects:
      – SubjectFull: K-means clustering
        Type: general
      – SubjectFull: Dimensional reduction algorithms
        Type: general
      – SubjectFull: Feature extraction
        Type: general
      – SubjectFull: Wireless geolocation systems
        Type: general
      – SubjectFull: Artificial neural networks
        Type: general
      – SubjectFull: Search algorithms
        Type: general
    Titles:
      – TitleFull: Siamese-ViT: A Local–Global Feature Fusion Method for Real-Time Visual Navigation of UAVs in Real-World Environments.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Cheng, Yu
      – PersonEntity:
          Name:
            NameFull: Liu, Xixiang
      – PersonEntity:
          Name:
            NameFull: Chen, Shuai
      – PersonEntity:
          Name:
            NameFull: Xu, Chuan
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 15
              M: 05
              Text: May2026
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 20724292
          Numbering:
            – Type: volume
              Value: 18
            – Type: issue
              Value: 10
          Titles:
            – TitleFull: Remote Sensing
              Type: main
ResultId 1