Table of Contents
Fetching ...

Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

TL;DR

This work addresses generating sentiment-controlled feedback from multimodal inputs (text and images) by introducing the CMFeed dataset and a dual-branch feedback synthesis system with a controllability layer. The textual encoder uses a Transformer while the visual encoder relies on Faster R-CNN, and a KAAP-based interpretability framework reveals how features drive sentiment in generated feedback. Empirical results show a sentiment classification accuracy of $77.23\%$ and improved semantic relevance and ranking (MRR $=0.3789$) over baselines, with human evaluation validating sentiment alignment and relevance. The dataset and code are publicly available, enabling research in empathetic, context-aware feedback for education, healthcare, marketing, and customer service, along with transparent control signals to foster user trust.

Abstract

The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and proposed a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and a Faster R-CNN network, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than the accuracy without controllability. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.

Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

TL;DR

This work addresses generating sentiment-controlled feedback from multimodal inputs (text and images) by introducing the CMFeed dataset and a dual-branch feedback synthesis system with a controllability layer. The textual encoder uses a Transformer while the visual encoder relies on Faster R-CNN, and a KAAP-based interpretability framework reveals how features drive sentiment in generated feedback. Empirical results show a sentiment classification accuracy of and improved semantic relevance and ranking (MRR ) over baselines, with human evaluation validating sentiment alignment and relevance. The dataset and code are publicly available, enabling research in empathetic, context-aware feedback for education, healthcare, marketing, and customer service, along with transparent control signals to foster user trust.

Abstract

The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and proposed a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and a Faster R-CNN network, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than the accuracy without controllability. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.
Paper Structure (42 sections, 9 equations, 5 figures, 8 tables)

This paper contains 42 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Proposed system's architecture with encoder, decoder and controllability blocks for textual and visual data. Decoder, convolution-gated unit, control layer, and similarity modules appear as subblocks; on/off neurons as black/white circles.
  • Figure 2: Depiction of the proposed interpretability technique. Here $k_i$ and $k_t$ are number of partitions for image and text, $w_i$ is the image's width and $L_t$ is the text feature vector's length.
  • Figure 3: Sample feedbacks generated by the proposed system using input text and images (one out of multiple images shown) with sentiment-control. Supplementary material's Fig. S1 depicts feature heatmaps, salient words and color denotations.
  • Figure S1:
  • Figure S2: Sample results along with interpretability plots. They depict the feedback generated by the proposed system using the news headline, text, and images (two out of multiple images shown) under the given sentiment-controllability constraint.