A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis
Changzhi Zhou, Dandan Song, Yuhang Tian, Zhijing Wu, Hao Wang, Xinyu Zhang, Jun Yang, Ziyi Yang, Shuhao Zhang
TL;DR
This work conducts a comprehensive, unified evaluation of Large Language Models on Aspect-Based Sentiment Analysis across 13 datasets and 8 subtasks using 6 LLMs. It compares full fine-tuning of SLMs, efficient fine-tuning of LLMs with LoRA, and zero-/few-shot API-based LLMs with carefully designed demonstration strategies, revealing that LLMs generally outperform SLMs in both fine-tuning-dependent and fine-tuning-free paradigms. The authors introduce a unified ABSA task formulation and three demonstration strategies (Random, BM25, SimCSE), showing that demonstration selection significantly affects ICL performance and that a hybrid strategy often yields the best results. Key findings include state-of-the-art results for LoRA-tuned LLMs on ABSA subtasks and strong zero-shot/few-shot performance by API LLMs, with notable variability depending on subtask, model, and demonstration approach, highlighting both the promise and limits of LLMs for ABSA in low-resource settings.
Abstract
Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.'' For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks.
