Table of Contents
Fetching ...

A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis

Changzhi Zhou, Dandan Song, Yuhang Tian, Zhijing Wu, Hao Wang, Xinyu Zhang, Jun Yang, Ziyi Yang, Shuhao Zhang

TL;DR

This work conducts a comprehensive, unified evaluation of Large Language Models on Aspect-Based Sentiment Analysis across 13 datasets and 8 subtasks using 6 LLMs. It compares full fine-tuning of SLMs, efficient fine-tuning of LLMs with LoRA, and zero-/few-shot API-based LLMs with carefully designed demonstration strategies, revealing that LLMs generally outperform SLMs in both fine-tuning-dependent and fine-tuning-free paradigms. The authors introduce a unified ABSA task formulation and three demonstration strategies (Random, BM25, SimCSE), showing that demonstration selection significantly affects ICL performance and that a hybrid strategy often yields the best results. Key findings include state-of-the-art results for LoRA-tuned LLMs on ABSA subtasks and strong zero-shot/few-shot performance by API LLMs, with notable variability depending on subtask, model, and demonstration approach, highlighting both the promise and limits of LLMs for ABSA in low-resource settings.

Abstract

Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.'' For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks.

A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis

TL;DR

This work conducts a comprehensive, unified evaluation of Large Language Models on Aspect-Based Sentiment Analysis across 13 datasets and 8 subtasks using 6 LLMs. It compares full fine-tuning of SLMs, efficient fine-tuning of LLMs with LoRA, and zero-/few-shot API-based LLMs with carefully designed demonstration strategies, revealing that LLMs generally outperform SLMs in both fine-tuning-dependent and fine-tuning-free paradigms. The authors introduce a unified ABSA task formulation and three demonstration strategies (Random, BM25, SimCSE), showing that demonstration selection significantly affects ICL performance and that a hybrid strategy often yields the best results. Key findings include state-of-the-art results for LoRA-tuned LLMs on ABSA subtasks and strong zero-shot/few-shot performance by API LLMs, with notable variability depending on subtask, model, and demonstration approach, highlighting both the promise and limits of LLMs for ABSA in low-resource settings.

Abstract

Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.'' For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks.

Paper Structure

This paper contains 22 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An illustration of different ABSA subtasks. The $a, c, o$, and $s$ denote aspect term, aspect category, opinion term, and sentiment polarity, respectively.
  • Figure 2: An example of the ASTE task formulation.
  • Figure 3: An example of BM25-based and SimCSE-based selection strategies for ASTE subtask.
  • Figure 4: The F1 scores of ChatGPT on ASTE and ASQP subtasks. The "Hybrid" denotes combining three demonstrations selected by BM25 and three demonstrations selected by SimCSE, arranged in random order.
  • Figure 5: The F1 scores of LLaMA3-8B+Random and ChatGPT+Random on the OE subtask.
  • ...and 2 more figures