Unconventional application of k-means for distributed approximate similarity search.

Saved in:
Bibliographic Details
Title: Unconventional application of k-means for distributed approximate similarity search.
Authors: Ortega, Felipe1 (AUTHOR) felipe.ortega@urjc.es, Algar, Maria Jesus1 (AUTHOR) mariajesus.algar@urjc.es, de Diego, Isaac Martín1 (AUTHOR) isaac.martin@urjc.es, Moguerza, Javier M.1 (AUTHOR) javier.moguerza@urjc.es
Source: Information Sciences. Jan2023, Vol. 620, p208-234. 27p.
Subjects: Indexing, Metric spaces, Computing platforms, Distributed computing, Function spaces, Machine learning
Abstract: • This paper presents MASK, a multilevel algorithm for approximate similarity search • MASK can distribute the index over as many computing nodes as we can afford. • Experimental results show the applicability of this novel indexing method • MASK achieves superior performance with high-dimensional and high-sparsity datasets. Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up these queries. Moreover, since exact approaches to solving similarity queries can be complex and time-consuming, alternative options have emerged to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with k -means), an unconventional application of the k -means algorithm as the foundation of a multilevel index structure for approximate similarity search suitable for metric spaces. We show that this method leverages inherent properties of k -means for this purpose, like representing high-density data areas with fewer prototypes. An implementation of this new indexing procedure is evaluated using a synthetic dataset and two real-world datasets in high-dimensional and high-sparsity spaces. Experimental tests show that MASK performs better than alternative algorithms for approximate similarity search. Results are promising and underpin the applicability of this novel indexing method in multiple domains. [ABSTRACT FROM AUTHOR]
Copyright of Information Sciences is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 160632029
AccessLevel: 6
PubType: Periodical
PubTypeId: serialPeriodical
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Unconventional application of k-means for distributed approximate similarity search.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Ortega%2C+Felipe%22">Ortega, Felipe</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> felipe.ortega@urjc.es</i><br /><searchLink fieldCode="AR" term="%22Algar%2C+Maria+Jesus%22">Algar, Maria Jesus</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> mariajesus.algar@urjc.es</i><br /><searchLink fieldCode="AR" term="%22de+Diego%2C+Isaac+Martín%22">de Diego, Isaac Martín</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> isaac.martin@urjc.es</i><br /><searchLink fieldCode="AR" term="%22Moguerza%2C+Javier+M%2E%22">Moguerza, Javier M.</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> javier.moguerza@urjc.es</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Information+Sciences%22">Information Sciences</searchLink>. Jan2023, Vol. 620, p208-234. 27p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Indexing%22">Indexing</searchLink><br /><searchLink fieldCode="DE" term="%22Metric+spaces%22">Metric spaces</searchLink><br /><searchLink fieldCode="DE" term="%22Computing+platforms%22">Computing platforms</searchLink><br /><searchLink fieldCode="DE" term="%22Distributed+computing%22">Distributed computing</searchLink><br /><searchLink fieldCode="DE" term="%22Function+spaces%22">Function spaces</searchLink><br /><searchLink fieldCode="DE" term="%22Machine+learning%22">Machine learning</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: • This paper presents MASK, a multilevel algorithm for approximate similarity search • MASK can distribute the index over as many computing nodes as we can afford. • Experimental results show the applicability of this novel indexing method • MASK achieves superior performance with high-dimensional and high-sparsity datasets. Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up these queries. Moreover, since exact approaches to solving similarity queries can be complex and time-consuming, alternative options have emerged to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with k -means), an unconventional application of the k -means algorithm as the foundation of a multilevel index structure for approximate similarity search suitable for metric spaces. We show that this method leverages inherent properties of k -means for this purpose, like representing high-density data areas with fewer prototypes. An implementation of this new indexing procedure is evaluated using a synthetic dataset and two real-world datasets in high-dimensional and high-sparsity spaces. Experimental tests show that MASK performs better than alternative algorithms for approximate similarity search. Results are promising and underpin the applicability of this novel indexing method in multiple domains. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Information Sciences is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=160632029
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1016/j.ins.2022.11.024
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 27
        StartPage: 208
    Subjects:
      – SubjectFull: Indexing
        Type: general
      – SubjectFull: Metric spaces
        Type: general
      – SubjectFull: Computing platforms
        Type: general
      – SubjectFull: Distributed computing
        Type: general
      – SubjectFull: Function spaces
        Type: general
      – SubjectFull: Machine learning
        Type: general
    Titles:
      – TitleFull: Unconventional application of k-means for distributed approximate similarity search.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Ortega, Felipe
      – PersonEntity:
          Name:
            NameFull: Algar, Maria Jesus
      – PersonEntity:
          Name:
            NameFull: de Diego, Isaac Martín
      – PersonEntity:
          Name:
            NameFull: Moguerza, Javier M.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 05
              M: 01
              Text: Jan2023
              Type: published
              Y: 2023
          Identifiers:
            – Type: issn-print
              Value: 00200255
          Numbering:
            – Type: volume
              Value: 620
          Titles:
            – TitleFull: Information Sciences
              Type: main
ResultId 1