Relationship Analysis of Image-Text Pair in SNS Posts
Takuto Nabeoka, Yijun Duan, Qiang Ma
TL;DR
This work tackles the task of classifying image-text pairs in SNS posts into Similar and Complementary relationships, addressing the challenge of detecting Complementary information. It introduces a graph-based pipeline that encodes image-text pairs with CLIP, clusters to form an ITRC-Graph, converts to an ITRC-Line Graph, and leverages a GCNII to learn edge representations, which are fused with the original embeddings and fed to an MLP classifier. The approach demonstrates notable improvements in recognizing Complementary relationships, achieving higher Macro-F1 and Complementary F1 than prior methods on the DisRel dataset, with reported Complementary F1 around 0.67 and overall accuracy near 0.70. The method highlights the value of inter-pair relational modeling through clustered graph structures and multimodal fusion for robust multimodal relationship understanding in SNS content.
Abstract
Social networking services (SNS) contain vast amounts of image-text posts, necessitating effective analysis of their relationships for improved information retrieval. This study addresses the classification of image-text pairs in SNS, overcoming prior limitations in distinguishing relationships beyond similarity. We propose a graph-based method to classify image-text pairs into similar and complementary relationships. Our approach first embeds images and text using CLIP, followed by clustering. Next, we construct an Image-Text Relationship Clustering Line Graph (ITRC-Line Graph), where clusters serve as nodes. Finally, edges and nodes are swapped in a pseudo-graph representation. A Graph Convolutional Network (GCN) then learns node and edge representations, which are fused with the original embeddings for final classification. Experimental results on a publicly available dataset demonstrate the effectiveness of our method.
