Table of Contents
Fetching ...

MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection

Sudhanshu Bhoi, Swapnil Markhedkar, Shruti Phadke, Prashant Agrawal

TL;DR

The paper tackles the challenge of abundant duplicate posts across social media by proposing MultiSiam, a condensed Siamese network capable of handling more than two inputs, and SMCD, a combined social media classification and duplication model. MultiSiam learns group aware embeddings through a shared sub network and a generalized triplet loss, while SMCD uses those embeddings for both category prediction and duplicate grouping. Empirical results on a Quora dataset show competitive accuracy with a standard Siamese baseline, and SMCD demonstrates the feasibility of joint categorization and deduplication on a custom 13 category dataset, illustrating potential for cross platform feed optimization. The approach aims to improve information access and user experience by delivering categorized, non redundant social media content, with future avenues including transfer learning, new datasets, and alternative loss functions.

Abstract

Social media accounts post increasingly similar content, creating a chaotic experience across platforms, which makes accessing desired information difficult. These posts can be organized by categorizing and grouping duplicates across social handles and accounts. There can be more than one duplicate of a post, however, a conventional Siamese neural network only considers a pair of inputs for duplicate text detection. In this paper, we first propose a multiple-input Siamese network, MultiSiam. This condensed network is then used to propose another model, SMCD (Social Media Classification and Duplication Model) to perform both duplicate text grouping and categorization. The MultiSiam network, just like the Siamese, can be used in multiple applications by changing the sub-network appropriately.

MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection

TL;DR

The paper tackles the challenge of abundant duplicate posts across social media by proposing MultiSiam, a condensed Siamese network capable of handling more than two inputs, and SMCD, a combined social media classification and duplication model. MultiSiam learns group aware embeddings through a shared sub network and a generalized triplet loss, while SMCD uses those embeddings for both category prediction and duplicate grouping. Empirical results on a Quora dataset show competitive accuracy with a standard Siamese baseline, and SMCD demonstrates the feasibility of joint categorization and deduplication on a custom 13 category dataset, illustrating potential for cross platform feed optimization. The approach aims to improve information access and user experience by delivering categorized, non redundant social media content, with future avenues including transfer learning, new datasets, and alternative loss functions.

Abstract

Social media accounts post increasingly similar content, creating a chaotic experience across platforms, which makes accessing desired information difficult. These posts can be organized by categorizing and grouping duplicates across social handles and accounts. There can be more than one duplicate of a post, however, a conventional Siamese neural network only considers a pair of inputs for duplicate text detection. In this paper, we first propose a multiple-input Siamese network, MultiSiam. This condensed network is then used to propose another model, SMCD (Social Media Classification and Duplication Model) to perform both duplicate text grouping and categorization. The MultiSiam network, just like the Siamese, can be used in multiple applications by changing the sub-network appropriately.
Paper Structure (12 sections, 8 equations, 3 figures, 3 tables)

This paper contains 12 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The MultiSiam network architecture with example sub-networks.
  • Figure 2: MultiSiam, during inference, with distance measure and example groups of given inputs based on the embeddings produced.
  • Figure 3: SMCD model for text categorization and duplicate text detection with respective losses.