Language-guided invariance probing of vision–language models.

Saved in:
Bibliographic Details
Title: Language-guided invariance probing of vision–language models.
Authors: Joong Lee, Jae1 (AUTHOR) lee2161@purdue.edu
Source: Pattern Recognition Letters. Apr2026, Vol. 202, p108-113. 6p.
Subjects: Paraphrase
Abstract: • Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR]
Copyright of Pattern Recognition Letters is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 192227091
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Language-guided invariance probing of vision–language models.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Joong+Lee%2C+Jae%22">Joong Lee, Jae</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> lee2161@purdue.edu</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Pattern+Recognition+Letters%22">Pattern Recognition Letters</searchLink>. Apr2026, Vol. 202, p108-113. 6p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Paraphrase%22">Paraphrase</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: • Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Pattern Recognition Letters is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=192227091
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1016/j.patrec.2026.02.012
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 6
        StartPage: 108
    Subjects:
      – SubjectFull: Paraphrase
        Type: general
    Titles:
      – TitleFull: Language-guided invariance probing of vision–language models.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Joong Lee, Jae
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 04
              Text: Apr2026
              Type: published
              Y: 2026
          Identifiers:
            – Type: issn-print
              Value: 01678655
          Numbering:
            – Type: volume
              Value: 202
          Titles:
            – TitleFull: Pattern Recognition Letters
              Type: main
ResultId 1