View in EDS

Language-guided invariance probing of vision–language models.

Saved in:

Bibliographic Details
Title:	Language-guided invariance probing of vision–language models.
Authors:	Joong Lee, Jae¹ (AUTHOR) lee2161@purdue.edu
Source:	Pattern Recognition Letters. Apr2026, Vol. 202, p108-113. 6p.
Subjects:	Paraphrase
Abstract:	• Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR]
	Copyright of Pattern Recognition Letters is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

Description
Abstract:	• Introduce LGIP, a benchmark for probing linguistic robustness of VLMs. • Generate paraphrases and semantic flips on MS COCO captions automatically. • Show EVA02-CLIP and OpenCLIP achieve a favorable invariance sensitivity tradeoff. • Reveal SigLIP models often prefer flipped captions above human ground truth. [Display omitted] Vision–language models (VLMs) achieve strong zero-shot performance, yet their robustness to controlled linguistic perturbations remains poorly characterized. We propose Language-Guided Invariance Probing (LGIP), a benchmark that quantifies (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image–text matching. On 40k MS COCO images (five captions each), we generate paraphrases and rule-based flips that modify object category, color, or count, and evaluate frozen encoders with an invariance error, a sensitivity gap, and a positive-rate metric. Experiments on nine VLMs show that EVA02-CLIP and large OpenCLIP variants achieve a favorable invariance–sensitivity trade-off, whereas SigLIP and SigLIP2 exhibit substantially higher invariance error and can score flipped captions above human descriptions, particularly for object and color edits. These behaviors are largely obscured by standard retrieval metrics, highlighting LGIP as a lightweight, model-agnostic diagnostic of linguistic robustness beyond conventional accuracy. [ABSTRACT FROM AUTHOR]
ISSN:	01678655
DOI:	10.1016/j.patrec.2026.02.012