View in EDS

Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.

Saved in:

Bibliographic Details
Title:	Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.
Authors:	Aggarwal, Sajal¹ (AUTHOR) sajal_24phdit01@dtu.ac.in, Pandey, Ananya¹ (AUTHOR) ananyapandey_2k21phdit08@dtu.ac.in, Vishwakarma, Dinesh Kumar¹ (AUTHOR) dinesh@dtu.ac.in
Source:	Applied Intelligence. Aug2025, Vol. 55 Issue 12, p1-22. 22p.
Abstract:	An inherent incongruity between the literal interpretation and the intended connotation characterizes sarcasm. Though several studies have targeted text-based unimodal sarcasm detection, multimodal sarcasm detection is still an evolving discipline. Emerging forms of multimodal data, such as memes, demonstrate that studying textual data alone may be insufficient for sarcasm detection. Additional contextual information conveyed by images can completely alter the perceived connotation of the text, thus indicating a sarcastic sentiment only when the text and image are studied in combination. This study presents a novel framework for multimodal sarcasm detection that can process input triplets, i.e., the input text and its associated image, as provided in the datasets, and a supplementary modality introduced in the form of image captions. The visual semantic representation provided by image captions offers an additional opportunity to capture the textual and visual content discrepancies. The key components of the proposed model are: (1) a textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. [ABSTRACT FROM AUTHOR]
	Copyright of Applied Intelligence is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

FullText	Text: Availability: 0
Header	DbId: egs DbLabel: Engineering Source An: 186768243 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Aggarwal%2C+Sajal%22">Aggarwal, Sajal</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> sajal_24phdit01@dtu.ac.in</i><br /><searchLink fieldCode="AR" term="%22Pandey%2C+Ananya%22">Pandey, Ananya</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> ananyapandey_2k21phdit08@dtu.ac.in</i><br /><searchLink fieldCode="AR" term="%22Vishwakarma%2C+Dinesh+Kumar%22">Vishwakarma, Dinesh Kumar</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> dinesh@dtu.ac.in</i> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22Applied+Intelligence%22">Applied Intelligence</searchLink>. Aug2025, Vol. 55 Issue 12, p1-22. 22p. – Name: Abstract Label: Abstract Group: Ab Data: An inherent incongruity between the literal interpretation and the intended connotation characterizes sarcasm. Though several studies have targeted text-based unimodal sarcasm detection, multimodal sarcasm detection is still an evolving discipline. Emerging forms of multimodal data, such as memes, demonstrate that studying textual data alone may be insufficient for sarcasm detection. Additional contextual information conveyed by images can completely alter the perceived connotation of the text, thus indicating a sarcastic sentiment only when the text and image are studied in combination. This study presents a novel framework for multimodal sarcasm detection that can process input triplets, i.e., the input text and its associated image, as provided in the datasets, and a supplementary modality introduced in the form of image captions. The visual semantic representation provided by image captions offers an additional opportunity to capture the textual and visual content discrepancies. The key components of the proposed model are: (1) a textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of Applied Intelligence is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=186768243
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1007/s10489-025-06717-6 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 22 StartPage: 1 Titles: – TitleFull: Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Aggarwal, Sajal – PersonEntity: Name: NameFull: Pandey, Ananya – PersonEntity: Name: NameFull: Vishwakarma, Dinesh Kumar IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 08 Text: Aug2025 Type: published Y: 2025 Identifiers: – Type: issn-print Value: 0924669X Numbering: – Type: volume Value: 55 – Type: issue Value: 12 Titles: – TitleFull: Applied Intelligence Type: main
ResultId	1