View in EDS

Language-guided invariance probing of vision–language models.

Saved in:

Bibliographic Details
Title:	Language-guided invariance probing of vision–language models.
Authors:	Joong Lee, Jae¹ (AUTHOR) lee2161@purdue.edu
Source:	Pattern Recognition Letters. Apr2026, Vol. 202, p108-113. 6p.
Subjects:	Paraphrase
Abstract:	• Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR]
	Copyright of Pattern Recognition Letters is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

FullText	Text: Availability: 0
Header	DbId: egs DbLabel: Engineering Source An: 192227091 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Language-guided invariance probing of vision–language models. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Joong+Lee%2C+Jae%22">Joong Lee, Jae</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> lee2161@purdue.edu</i> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22Pattern+Recognition+Letters%22">Pattern Recognition Letters</searchLink>. Apr2026, Vol. 202, p108-113. 6p. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Paraphrase%22">Paraphrase</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: • Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of Pattern Recognition Letters is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=192227091
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1016/j.patrec.2026.02.012 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 6 StartPage: 108 Subjects: – SubjectFull: Paraphrase Type: general Titles: – TitleFull: Language-guided invariance probing of vision–language models. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Joong Lee, Jae IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 04 Text: Apr2026 Type: published Y: 2026 Identifiers: – Type: issn-print Value: 01678655 Numbering: – Type: volume Value: 202 Titles: – TitleFull: Pattern Recognition Letters Type: main
ResultId	1