Improving Multilingual Social Media Insights: Aspect-based Comment Analysis
Longyin Zhang, Bowei Zou, Ai Ti Aw
TL;DR
This work tackles the noise and multilinguality of social media comments by introducing Comment Aspect Term Generation (CAT-G), a method that extracts central opinion targets to guide model attention. It combines supervised fine-tuning of multilingual large language models with Direct Preference Optimization (DPO) to align CAT outputs with human preferences, and introduces the first multilingual CAT-G dataset for EN, CN, MS, and ID. The CAT-G signals are integrated into monolingual and cross-lingual comment clustering (ComC), yielding measurable gains such as a cross-lingual clustering improvement of +$2.54$ $NMI$, demonstrating practical impact for robust social media analysis. The work also provides a benchmark framework, discusses limitations in distribution alignment and sentiment coupling, and addresses potential biases and risks associated with data-driven CAT generation.
Abstract
The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.
