Table of Contents
Fetching ...

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

TL;DR

The paper tackles scalable tagging of multimodal social-media content by introducing TagGPT, a modular zero-shot tagging framework that builds a tagging system from textual clues extracted via OCR/ASR and other unimodal models and then uses LLM-driven generation or context-aware inference to assign tags. It presents a three-part tagging-system construction pipeline (textual clue conversion, LLM-based tag generation, post-processing) and two zero-shot taggers (generative and selective) built on a reusable framework with GPT-3.5 and SimCSE. Through experiments on Kuaishou and Food.com, TagGPT shows improved tag coverage, reduced redundancy, and competitive or superior tagging precision compared to baselines, with qualitative case studies highlighting the advantages of generative tagging. The work offers a practical, low-cost approach to cross-modal tagging with strong generalization potential and discusses limitations around LLM dependence, input length, and privacy, outlining avenues for future improvement.

Abstract

Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i.e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts. It is well noticed that TagGPT provides a system-level solution based on a modular framework equipped with a pre-trained LLM (GPT-3.5 used here) and a sentence embedding model (SimCSE used here), which can be seamlessly replaced with any more advanced one you want. TagGPT is applicable for various modalities of data in modern social media and showcases strong generalization ability to a wide range of applications. We evaluate TagGPT on publicly available datasets, i.e., Kuaishou and Food.com, and demonstrate the effectiveness of TagGPT compared to existing hashtags and off-the-shelf taggers. Project page: https://github.com/TencentARC/TagGPT.

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

TL;DR

The paper tackles scalable tagging of multimodal social-media content by introducing TagGPT, a modular zero-shot tagging framework that builds a tagging system from textual clues extracted via OCR/ASR and other unimodal models and then uses LLM-driven generation or context-aware inference to assign tags. It presents a three-part tagging-system construction pipeline (textual clue conversion, LLM-based tag generation, post-processing) and two zero-shot taggers (generative and selective) built on a reusable framework with GPT-3.5 and SimCSE. Through experiments on Kuaishou and Food.com, TagGPT shows improved tag coverage, reduced redundancy, and competitive or superior tagging precision compared to baselines, with qualitative case studies highlighting the advantages of generative tagging. The work offers a practical, low-cost approach to cross-modal tagging with strong generalization potential and discusses limitations around LLM dependence, input length, and privacy, outlining avenues for future improvement.

Abstract

Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i.e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts. It is well noticed that TagGPT provides a system-level solution based on a modular framework equipped with a pre-trained LLM (GPT-3.5 used here) and a sentence embedding model (SimCSE used here), which can be seamlessly replaced with any more advanced one you want. TagGPT is applicable for various modalities of data in modern social media and showcases strong generalization ability to a wide range of applications. We evaluate TagGPT on publicly available datasets, i.e., Kuaishou and Food.com, and demonstrate the effectiveness of TagGPT compared to existing hashtags and off-the-shelf taggers. Project page: https://github.com/TencentARC/TagGPT.
Paper Structure (22 sections, 6 figures, 5 tables)

This paper contains 22 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Given multimodal content from social media (e.g., Twitter, Weibo, etc.), a tagger aims to produce several phrases that can properly describe the content and reflect the user's interests.
  • Figure 2: Given a series of raw data from a specific application, TagGPT is capable of building a high-quality tagging system in an entirely zero-shot manner without extra knowledge or human annotation. Such a paradigm enables instant tagging of new applications with zero labor cost.
  • Figure 3: Given the tagging system established in Figure \ref{['fig:tag_system']}, TagGPT enables zero-shot tagging of new data in two alternative paradigms.
  • Figure 4: Statistical results of the metric "popularity" in the datasets. The horizontal axis denotes the number of times a tag is assigned to the data, and the vertical axis denotes the number of tags.
  • Figure 5: Statistical results of "least effort" in the metric “practicality" in the dataset. The horizontal axis denotes the number of tags assigned to a single data sample, and the vertical axis is the number of samples.
  • ...and 1 more figures