Table of Contents
Fetching ...

Can Large Language Models Understand Internet Buzzwords Through User-Generated Content

Chen Huang, Junkai Luo, Xinzuo Wang, Wenqiang Lei, Jiancheng Lv

TL;DR

The paper addresses the challenge of understanding rapidly evolving Chinese internet buzzwords by generating context-aware definitions from user-generated content. It introduces the CHEER dataset and the Ress prompting framework, which decomposes buzzword comprehension into six child-language-learning–inspired aspects and ensembles aspect-specific definitions. Through a comprehensive benchmark across multiple backbones, Ress generally improves semantic accuracy and completeness compared to strong baselines, but overall performance remains far from optimal, especially for unseen buzzwords, highlighting the crucial roles of UGC quality and volume. The work bridges linguistics and NLP by providing a public dataset and a methodological approach that underscores the need for better inferring meanings from contextual usage, with implications for dictionary construction and socio-psycholinguistic research.

Abstract

The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at https://github.com/SCUNLP/Buzzword.

Can Large Language Models Understand Internet Buzzwords Through User-Generated Content

TL;DR

The paper addresses the challenge of understanding rapidly evolving Chinese internet buzzwords by generating context-aware definitions from user-generated content. It introduces the CHEER dataset and the Ress prompting framework, which decomposes buzzword comprehension into six child-language-learning–inspired aspects and ensembles aspect-specific definitions. Through a comprehensive benchmark across multiple backbones, Ress generally improves semantic accuracy and completeness compared to strong baselines, but overall performance remains far from optimal, especially for unseen buzzwords, highlighting the crucial roles of UGC quality and volume. The work bridges linguistics and NLP by providing a public dataset and a methodological approach that underscores the need for better inferring meanings from contextual usage, with implications for dictionary construction and socio-psycholinguistic research.

Abstract

The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at https://github.com/SCUNLP/Buzzword.

Paper Structure

This paper contains 31 sections, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Task Illustration: generating definitions for Chinese buzzwords using UGC.
  • Figure 2: Illustration of Ress.
  • Figure 3: Human evaluation across different methods and LLM backbones via win rate. Ress produces better buzzword definitions from a user-centric perspective.
  • Figure 4: Semantic diversity analysis of aspect-specific definitions, measured by 1.0-Bscore. These aspects offer a multifaceted approach to understanding buzzwords.
  • Figure 5: Ress ablation on the number of aspects. We evaluate the performance of various aspect combinations of fixed sizes (i.e., 1, 3, 5) and report their mean and standard deviation. Employing an ensemble of aspects frequently demonstrates advantages.
  • ...and 2 more figures