Table of Contents
Fetching ...

Relationship Analysis of Image-Text Pair in SNS Posts

Takuto Nabeoka, Yijun Duan, Qiang Ma

TL;DR

This work tackles the task of classifying image-text pairs in SNS posts into Similar and Complementary relationships, addressing the challenge of detecting Complementary information. It introduces a graph-based pipeline that encodes image-text pairs with CLIP, clusters to form an ITRC-Graph, converts to an ITRC-Line Graph, and leverages a GCNII to learn edge representations, which are fused with the original embeddings and fed to an MLP classifier. The approach demonstrates notable improvements in recognizing Complementary relationships, achieving higher Macro-F1 and Complementary F1 than prior methods on the DisRel dataset, with reported Complementary F1 around 0.67 and overall accuracy near 0.70. The method highlights the value of inter-pair relational modeling through clustered graph structures and multimodal fusion for robust multimodal relationship understanding in SNS content.

Abstract

Social networking services (SNS) contain vast amounts of image-text posts, necessitating effective analysis of their relationships for improved information retrieval. This study addresses the classification of image-text pairs in SNS, overcoming prior limitations in distinguishing relationships beyond similarity. We propose a graph-based method to classify image-text pairs into similar and complementary relationships. Our approach first embeds images and text using CLIP, followed by clustering. Next, we construct an Image-Text Relationship Clustering Line Graph (ITRC-Line Graph), where clusters serve as nodes. Finally, edges and nodes are swapped in a pseudo-graph representation. A Graph Convolutional Network (GCN) then learns node and edge representations, which are fused with the original embeddings for final classification. Experimental results on a publicly available dataset demonstrate the effectiveness of our method.

Relationship Analysis of Image-Text Pair in SNS Posts

TL;DR

This work tackles the task of classifying image-text pairs in SNS posts into Similar and Complementary relationships, addressing the challenge of detecting Complementary information. It introduces a graph-based pipeline that encodes image-text pairs with CLIP, clusters to form an ITRC-Graph, converts to an ITRC-Line Graph, and leverages a GCNII to learn edge representations, which are fused with the original embeddings and fed to an MLP classifier. The approach demonstrates notable improvements in recognizing Complementary relationships, achieving higher Macro-F1 and Complementary F1 than prior methods on the DisRel dataset, with reported Complementary F1 around 0.67 and overall accuracy near 0.70. The method highlights the value of inter-pair relational modeling through clustered graph structures and multimodal fusion for robust multimodal relationship understanding in SNS content.

Abstract

Social networking services (SNS) contain vast amounts of image-text posts, necessitating effective analysis of their relationships for improved information retrieval. This study addresses the classification of image-text pairs in SNS, overcoming prior limitations in distinguishing relationships beyond similarity. We propose a graph-based method to classify image-text pairs into similar and complementary relationships. Our approach first embeds images and text using CLIP, followed by clustering. Next, we construct an Image-Text Relationship Clustering Line Graph (ITRC-Line Graph), where clusters serve as nodes. Finally, edges and nodes are swapped in a pseudo-graph representation. A Graph Convolutional Network (GCN) then learns node and edge representations, which are fused with the original embeddings for final classification. Experimental results on a publicly available dataset demonstrate the effectiveness of our method.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example of X's posts and the image-text pair class defined in DisRel sosea2021Using
  • Figure 2: Overview of Proposed Method
  • Figure 3: The process of Clustered Edge Embedding
  • Figure 4: Learning Model of ITRC-Line Graph
  • Figure 5: Distribution of image and text embeddings after dimensionality reduction by the CLIP encoder. "Complementary" and "Similar" class in DisRel are illustrated in blue and orange, respectively.