Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.
Saved in:
| Title: | Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection. |
|---|---|
| Authors: | Aggarwal, Sajal1 (AUTHOR) sajal_24phdit01@dtu.ac.in, Pandey, Ananya1 (AUTHOR) ananyapandey_2k21phdit08@dtu.ac.in, Vishwakarma, Dinesh Kumar1 (AUTHOR) dinesh@dtu.ac.in |
| Source: | Applied Intelligence. Aug2025, Vol. 55 Issue 12, p1-22. 22p. |
| Abstract: | An inherent incongruity between the literal interpretation and the intended connotation characterizes sarcasm. Though several studies have targeted text-based unimodal sarcasm detection, multimodal sarcasm detection is still an evolving discipline. Emerging forms of multimodal data, such as memes, demonstrate that studying textual data alone may be insufficient for sarcasm detection. Additional contextual information conveyed by images can completely alter the perceived connotation of the text, thus indicating a sarcastic sentiment only when the text and image are studied in combination. This study presents a novel framework for multimodal sarcasm detection that can process input triplets, i.e., the input text and its associated image, as provided in the datasets, and a supplementary modality introduced in the form of image captions. The visual semantic representation provided by image captions offers an additional opportunity to capture the textual and visual content discrepancies. The key components of the proposed model are: (1) a textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. [ABSTRACT FROM AUTHOR] |
| Copyright of Applied Intelligence is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
Be the first to leave a comment!