Stability Analysis of ChatGPT-based Sentiment Analysis in AI Quality Assurance
Tinghui Ouyang, AprilPyone MaungMaung, Koichi Konishi, Yoshiki Seo, Isao Echizen
TL;DR
This study in AI quality management analyzes the stability of a ChatGPT-based sentiment-analysis product by separately examining operation uncertainty and model robustness. It investigates architectural nondeterminism from sparse MoE routing, web versus API deployment differences, and timing-driven drift, alongside the impact of prompt engineering. Robustness is assessed on Amazon and SST datasets against four perturbations (typo, synonym, homoglyph, homophone) using $Acc$ and $ASR$, revealing synonym perturbations as the strongest attack while overall performance remains comparatively robust. The work highlights the practical need to monitor model versions, prompts, and timing when deploying LLM-based sentiment tools in QA pipelines. These insights offer a concrete stability evaluation framework for AI products leveraging foundation models in real-world QA contexts.
Abstract
In the era of large AI models, the complex architecture and vast parameters present substantial challenges for effective AI quality management (AIQM), e.g. large language model (LLM). This paper focuses on investigating the quality assurance of a specific LLM-based AI product--a ChatGPT-based sentiment analysis system. The study delves into stability issues related to both the operation and robustness of the expansive AI model on which ChatGPT is based. Experimental analysis is conducted using benchmark datasets for sentiment analysis. The results reveal that the constructed ChatGPT-based sentiment analysis system exhibits uncertainty, which is attributed to various operational factors. It demonstrated that the system also exhibits stability issues in handling conventional small text attacks involving robustness.
