Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.

Saved in:
Bibliographic Details
Title: Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.
Authors: Aggarwal, Sajal1 (AUTHOR) sajal_24phdit01@dtu.ac.in, Pandey, Ananya1 (AUTHOR) ananyapandey_2k21phdit08@dtu.ac.in, Vishwakarma, Dinesh Kumar1 (AUTHOR) dinesh@dtu.ac.in
Source: Applied Intelligence. Aug2025, Vol. 55 Issue 12, p1-22. 22p.
Abstract: An inherent incongruity between the literal interpretation and the intended connotation characterizes sarcasm. Though several studies have targeted text-based unimodal sarcasm detection, multimodal sarcasm detection is still an evolving discipline. Emerging forms of multimodal data, such as memes, demonstrate that studying textual data alone may be insufficient for sarcasm detection. Additional contextual information conveyed by images can completely alter the perceived connotation of the text, thus indicating a sarcastic sentiment only when the text and image are studied in combination. This study presents a novel framework for multimodal sarcasm detection that can process input triplets, i.e., the input text and its associated image, as provided in the datasets, and a supplementary modality introduced in the form of image captions. The visual semantic representation provided by image captions offers an additional opportunity to capture the textual and visual content discrepancies. The key components of the proposed model are: (1) a textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. [ABSTRACT FROM AUTHOR]
Copyright of Applied Intelligence is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 186768243
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Aggarwal%2C+Sajal%22">Aggarwal, Sajal</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> sajal_24phdit01@dtu.ac.in</i><br /><searchLink fieldCode="AR" term="%22Pandey%2C+Ananya%22">Pandey, Ananya</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> ananyapandey_2k21phdit08@dtu.ac.in</i><br /><searchLink fieldCode="AR" term="%22Vishwakarma%2C+Dinesh+Kumar%22">Vishwakarma, Dinesh Kumar</searchLink><relatesTo>1</relatesTo> (AUTHOR)<i> dinesh@dtu.ac.in</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Applied+Intelligence%22">Applied Intelligence</searchLink>. Aug2025, Vol. 55 Issue 12, p1-22. 22p.
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: An inherent incongruity between the literal interpretation and the intended connotation characterizes sarcasm. Though several studies have targeted text-based unimodal sarcasm detection, multimodal sarcasm detection is still an evolving discipline. Emerging forms of multimodal data, such as memes, demonstrate that studying textual data alone may be insufficient for sarcasm detection. Additional contextual information conveyed by images can completely alter the perceived connotation of the text, thus indicating a sarcastic sentiment only when the text and image are studied in combination. This study presents a novel framework for multimodal sarcasm detection that can process input triplets, i.e., the input text and its associated image, as provided in the datasets, and a supplementary modality introduced in the form of image captions. The visual semantic representation provided by image captions offers an additional opportunity to capture the textual and visual content discrepancies. The key components of the proposed model are: (1) a textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Applied Intelligence is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=186768243
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1007/s10489-025-06717-6
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 22
        StartPage: 1
    Titles:
      – TitleFull: Extracting cross-modal semantic incongruity with attention for multimodal sarcasm detection.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Aggarwal, Sajal
      – PersonEntity:
          Name:
            NameFull: Pandey, Ananya
      – PersonEntity:
          Name:
            NameFull: Vishwakarma, Dinesh Kumar
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 08
              Text: Aug2025
              Type: published
              Y: 2025
          Identifiers:
            – Type: issn-print
              Value: 0924669X
          Numbering:
            – Type: volume
              Value: 55
            – Type: issue
              Value: 12
          Titles:
            – TitleFull: Applied Intelligence
              Type: main
ResultId 1