AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Jiale Cheng; Yida Lu; Xiaotao Gu; Pei Ke; Xiao Liu; Yuxiao Dong; Hongning Wang; Jie Tang; Minlie Huang

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Jiale Cheng, Yida Lu, Xiaotao Gu, Pei Ke, Xiao Liu, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

TL;DR

AutoDetect introduces a unified, three-agent framework to automatically identify weaknesses in large language models across instruction-following, mathematics, and coding. By constructing a dynamic taxonomy, generating adaptive questions, and evaluating responses with a strong LLM judge, it achieves notable identification rates and guides targeted model improvements through iterative data-driven fine-tuning. The approach outperforms baselines in both weakness discovery and resulting performance gains, with demonstrated improvements across open-source LLMs and strong models alike. The work highlights practical implications for robust LLM alignment and offers a scalable pathway to systematically strengthen model reliability.

Abstract

Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

TL;DR

Abstract

Paper Structure (42 sections, 6 equations, 14 figures, 7 tables)

This paper contains 42 sections, 6 equations, 14 figures, 7 tables.

Introduction
Related Work
Evaluation Benchmarks
Red Teaming
Method
Problem Definition
AutoDetect Framework
Iterative Search
Model Enhancement
Experiments
Weakness Detection
Evaluation Metrics
Human Evaluation
Results
Model Enhancement
...and 27 more sections

Figures (14)

Figure 1: Effective weakness discovery can well guide model enhancement. AutoDetect can achieve high identification success rates in the instruction-following, mathematics, and coding tasks (A). Moreover, leveraging this data can further improve LLMs (B).
Figure 2: Our framework comprises two cycles, with the circulation consisting of the Examiner, Questioner, and Assessor, providing a comprehensive and tailored testing framework. Meanwhile, iterative search enables the adjustment of question difficulty for the target model, effectively identifying weaknesses.
Figure 3: The change in the average score during the iterative search process for the three tasks.
Figure 4: Improvement of Llama2-7b-Chat when training with identification data from GPT-3.5-turbo and itself.
Figure 5: Some weaknesses within LLMs revealed by AutoDetect. We flag the wrong parts of the responses in red, and some responses are omitted due to space restrictions.
...and 9 more figures

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

TL;DR

Abstract

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)