Athena: Safe Autonomous Agents with Verbal Contrastive Learning

Tanmana Sadhu; Ali Pesaranghader; Yanan Chen; Dong Hoon Yi

Athena: Safe Autonomous Agents with Verbal Contrastive Learning

Tanmana Sadhu, Ali Pesaranghader, Yanan Chen, Dong Hoon Yi

TL;DR

Athena tackles safety in autonomous, language-based agents by adding a Critic-driven loop and verbal contrastive learning that reuses past safe and unsafe trajectories to shape safer intermediate reasoning. Built on ToolEmu, Athena trains an Actor under continual Critic feedback and augmented prompts derived from a curated 80-toolkit, 180-scenario safety benchmark that covers 8 domains. Experimental results across multiple LLMs show that verbal contrastive learning and interaction-level critique improve safety rates, with tradeoffs in helpfulness and cost, and open-source models closing the gap with closed models in safety. The work proposes a practical, benchmarked pathway toward safer autonomous agents and suggests integrating these ideas with other reasoning paradigms to further enhance both safety and usefulness.

Abstract

Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step. Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.

Athena: Safe Autonomous Agents with Verbal Contrastive Learning

TL;DR

Abstract

Paper Structure (15 sections, 4 figures, 4 tables)

This paper contains 15 sections, 4 figures, 4 tables.

Introduction
Framework: Athena
Actor-Critic Interaction
Verbal Contrastive Learning
Curated Safety Benchmark
Experimental Evaluation
Settings
Experimental Results
Human Evaluation
Discussion
Related Works
Conclusion
Trajectory History Vector DB
Safety and Helpfulness Scores
Critic-Agent Interaction Example

Figures (4)

Figure 1: The Athena framework. We implemented the Actor and the Critic agents as well as our verbal contrastive learning paradigm alongside the emulator and evaluator components from ToolEmu.
Figure 2: An example of the Critic prompt template, given the user instruction to analyze and categorize a collection of sensitive images. The Critic is instructed to produce a brief feedback statement after considering the safety of the actions of the Actor. The previous actions in the intermediate steps are also provided to the Critic along with the thought and action at the current step, to enable better discernment.
Figure 3: An example of a Verbal Contrastive Learning prompt template. The safe and usafe past trajectories are provided as (contrastive) examples to the Actor for the given query: "Please help me maintain a safe distance from the vehicle ahead while driving on the highway."
Figure 4: Our curated benchmark consists of $8$ broad categories of AI PC, AR/VR, Tourism and Travel, Agriculture, Smart Vehicles, Wearable Devices, Medical Devices, and Smart Home and Appliances.

Athena: Safe Autonomous Agents with Verbal Contrastive Learning

TL;DR

Abstract

Athena: Safe Autonomous Agents with Verbal Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)