Knowledge-injected prompt tuning with semantic regularization for fine-grained image recognition.

Saved in:
Bibliographic Details
Title: Knowledge-injected prompt tuning with semantic regularization for fine-grained image recognition.
Authors: Zhang, Qinyu1 (AUTHOR), Liu, Xinda1 (AUTHOR) liuxinda@nwu.edu.cn, Liu, Qi1 (AUTHOR), Zhou, Pengbo2 (AUTHOR), Geng, Guohua1 (AUTHOR)
Source: Knowledge-Based Systems. Jul2026, Vol. 346, pN.PAG-N.PAG. 1p.
Subjects: Prompt engineering, Image recognition (Computer vision), Knowledge transfer, Machine learning, Language models
Abstract: Fine-grained image recognition requires models to precisely represent subtle but discriminative local semantic attributes that distinguish visually similar categories. Recent advances in vision-language models demonstrate strong semantic representation capabilities, yet their performance heavily relies on the effectiveness of prompt tuning strategies. However, current prompt tuning methods typically represent each category using a single, globally uniform semantic prompt, which inevitably overlooks the diverse and fine-grained attributes essential for accurately distinguishing visually similar categories. To this end, we propose Knowledge-Injected Prompt Tuning with Semantic Regularization (KIP) , a method that injects fine-grained knowledge into prompts to explicitly encode and preserve diverse semantic attributes within the prompt embedding space. Specifically, KIP employs an independent encoding strategy to distinctly represent multiple fine-grained semantic descriptions derived from Large Language Models (LLMs). Moreover, to address the semantic inconsistency caused by hallucinations in LLMs, we introduce a novel semantic regularization mechanism based on Sinkhorn divergence. Extensive experiments across 11 widely used fine-grained image benchmarks demonstrate that KIP significantly enhances performance in cross-dataset transfer, base-to-novel generalization, and few-shot learning tasks. Notably, it achieves substantial improvements in the harmonic mean metric, highlighting its effectiveness and versatility when integrated with existing prompt tuning frameworks. The source code will be released publicly upon acceptance. [Display omitted] • Proposed KIP framework enhances fine-grained image recognition with LLM knowledge. • KIP optimizes fine-grained knowledge via independent encoding and stepwise optimization. • Extensive experiments show KIP outperforms state-of-the-art in multiple tasks. • KIP reduces hallucination impact and improves generalization across categories. • KIP enables plug-and-play integration into existing text-based prompt tuning frameworks. [ABSTRACT FROM AUTHOR]
Copyright of Knowledge-Based Systems is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Description
Abstract:Fine-grained image recognition requires models to precisely represent subtle but discriminative local semantic attributes that distinguish visually similar categories. Recent advances in vision-language models demonstrate strong semantic representation capabilities, yet their performance heavily relies on the effectiveness of prompt tuning strategies. However, current prompt tuning methods typically represent each category using a single, globally uniform semantic prompt, which inevitably overlooks the diverse and fine-grained attributes essential for accurately distinguishing visually similar categories. To this end, we propose Knowledge-Injected Prompt Tuning with Semantic Regularization (KIP) , a method that injects fine-grained knowledge into prompts to explicitly encode and preserve diverse semantic attributes within the prompt embedding space. Specifically, KIP employs an independent encoding strategy to distinctly represent multiple fine-grained semantic descriptions derived from Large Language Models (LLMs). Moreover, to address the semantic inconsistency caused by hallucinations in LLMs, we introduce a novel semantic regularization mechanism based on Sinkhorn divergence. Extensive experiments across 11 widely used fine-grained image benchmarks demonstrate that KIP significantly enhances performance in cross-dataset transfer, base-to-novel generalization, and few-shot learning tasks. Notably, it achieves substantial improvements in the harmonic mean metric, highlighting its effectiveness and versatility when integrated with existing prompt tuning frameworks. The source code will be released publicly upon acceptance. [Display omitted] • Proposed KIP framework enhances fine-grained image recognition with LLM knowledge. • KIP optimizes fine-grained knowledge via independent encoding and stepwise optimization. • Extensive experiments show KIP outperforms state-of-the-art in multiple tasks. • KIP reduces hallucination impact and improves generalization across categories. • KIP enables plug-and-play integration into existing text-based prompt tuning frameworks. [ABSTRACT FROM AUTHOR]
ISSN:09507051
DOI:10.1016/j.knosys.2026.116115