Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance

Tinghui Ouyang; AprilPyone MaungMaung; Koichi Konishi; Yoshiki Seo; Isao Echizen

Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance

Tinghui Ouyang, AprilPyone MaungMaung, Koichi Konishi, Yoshiki Seo, Isao Echizen

TL;DR

This study in AI quality management analyzes the stability of a ChatGPT-based sentiment-analysis product by separately examining operation uncertainty and model robustness. It investigates architectural nondeterminism from sparse MoE routing, web versus API deployment differences, and timing-driven drift, alongside the impact of prompt engineering. Robustness is assessed on Amazon and SST datasets against four perturbations (typo, synonym, homoglyph, homophone) using $Acc$ and $ASR$, revealing synonym perturbations as the strongest attack while overall performance remains comparatively robust. The work highlights the practical need to monitor model versions, prompts, and timing when deploying LLM-based sentiment tools in QA pipelines. These insights offer a concrete stability evaluation framework for AI products leveraging foundation models in real-world QA contexts.

Abstract

In the era of large AI models, the complex architecture and vast parameters present substantial challenges for effective AI quality management (AIQM), e.g. large language model (LLM). This paper focuses on investigating the quality assurance of a specific LLM-based AI product--a ChatGPT-based sentiment analysis system. The study delves into stability issues related to both the operation and robustness of the expansive AI model on which ChatGPT is based. Experimental analysis is conducted using benchmark datasets for sentiment analysis. The results reveal that the constructed ChatGPT-based sentiment analysis system exhibits uncertainty, which is attributed to various operational factors. It demonstrated that the system also exhibits stability issues in handling conventional small text attacks involving robustness.

Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance

TL;DR

and

, revealing synonym perturbations as the strongest attack while overall performance remains comparatively robust. The work highlights the practical need to monitor model versions, prompts, and timing when deploying LLM-based sentiment tools in QA pipelines. These insights offer a concrete stability evaluation framework for AI products leveraging foundation models in real-world QA contexts.

Abstract

Paper Structure (15 sections, 2 equations, 8 figures, 5 tables)

This paper contains 15 sections, 2 equations, 8 figures, 5 tables.

Introduction
Overview
ChatGPT-based sentiment analysis system
Stability of AI
Uncertainty analysis
Model architecture design
Difference on using ChatGPT and ChatGPT API
Variance due to timing
Prompt engineering
robustness testing
Data preparation
Evaluation metrics
Perturbation and robustness analysis
Conclusion
Acknowledgement

Figures (8)

Figure 1: Diagram of using ChatGPT for sentiment analysis
Figure 2: ChatGPT’s responses on two devices at same time
Figure 3: ChatGPT for sentiment analysis at different time
Figure 4: Designs of zero-, one-, and few-shot prompts for sentiment analysis
Figure 5: Sentiment analysis results for different prompt settings
...and 3 more figures

Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance

TL;DR

Abstract

Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance

Authors

TL;DR

Abstract

Table of Contents

Figures (8)