Table of Contents
Fetching ...

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR

This work introduces StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence, and proposes several simple yet effective baselines for improving LLMs on StreamBench, and provides a comprehensive analysis to identify critical components that contribute to successful streaming strategies.

Abstract

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

TL;DR

This work introduces StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence, and proposes several simple yet effective baselines for improving LLMs on StreamBench, and provides a comprehensive analysis to identify critical components that contribute to successful streaming strategies.

Abstract

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.
Paper Structure (85 sections, 3 equations, 24 figures, 14 tables, 3 algorithms)

This paper contains 85 sections, 3 equations, 24 figures, 14 tables, 3 algorithms.

Figures (24)

  • Figure 1: (Left) A schematic diagram showing the online evaluation setting of StreamBench, where agents update their components ($p, r, \mathcal{M}$, or $\theta$) from an input-feedback sequence to achieve the highest final accuracy (refer to Section \ref{['sec:setup']} for details). (Right) Performance curve on the DDXPlus dataset on StreamBench. Agents are able to gradually improve with our proposed streaming baselines.
  • Figure 2: Correctness ablations. The y-axis denotes performance difference from zero-shot. The results are the average of three LLM endpoints. Please refer to Appendix \ref{['app:ablations']} for results of each LLM.
  • Figure 3: Confusion matrices of the diagnoses subset of upper respiratory tract diseases in DDXPlus.
  • Figure 4: Averaged performance and standard errors of each method on five shuffled sequences.
  • Figure 5: Averaged performance and standard errors of gpt-3.5-turbo on five shuffled sequences.
  • ...and 19 more figures