Table of Contents
Fetching ...

CompAgent: An Agentic Framework for Visual Compliance Verification

Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu

TL;DR

CompAgent tackles visual compliance verification by introducing an agentic framework that coordinates a Planning Agent, a modular Tool Suite, and a CVAgent to reason over images under evolving compliance policies. The Planning Agent selects relevant tools to gather evidence, while the CVAgent performs structured multimodal reasoning to output a rating, violation category, and rationale, all without requiring labeled data or fine-tuning. On LlavaGuard and UnsafeBench, CompAgent achieves state-of-the-art performance, exemplified by a Unsafe F1 of $0.93$ on LlavaGuard and $0.76$ on UnsafeBench, significantly surpassing prompt-based and fine-tuned baselines. The approach demonstrates strong generalization, interpretability, and training-free adaptability, offering a scalable path for automated visual compliance in dynamic policy environments and potential extensions to video content.

Abstract

Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent Multimodal Large Language Models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools-such as object detectors, face analyzers, NSFW detectors, and captioning models-and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A compliance verification agent then integrates image, tool outputs, and policy context to perform multimodal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and robust tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

CompAgent: An Agentic Framework for Visual Compliance Verification

TL;DR

CompAgent tackles visual compliance verification by introducing an agentic framework that coordinates a Planning Agent, a modular Tool Suite, and a CVAgent to reason over images under evolving compliance policies. The Planning Agent selects relevant tools to gather evidence, while the CVAgent performs structured multimodal reasoning to output a rating, violation category, and rationale, all without requiring labeled data or fine-tuning. On LlavaGuard and UnsafeBench, CompAgent achieves state-of-the-art performance, exemplified by a Unsafe F1 of on LlavaGuard and on UnsafeBench, significantly surpassing prompt-based and fine-tuned baselines. The approach demonstrates strong generalization, interpretability, and training-free adaptability, offering a scalable path for automated visual compliance in dynamic policy environments and potential extensions to video content.

Abstract

Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent Multimodal Large Language Models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools-such as object detectors, face analyzers, NSFW detectors, and captioning models-and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A compliance verification agent then integrates image, tool outputs, and policy context to perform multimodal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and robust tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

Paper Structure

This paper contains 29 sections, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Illustration of proposed CompAgent framework for visual compliance. Also presented for comparison are results from actual human judgment, and the state-of-the-art model: LlavaGuard helff_llavaguard_2025. CompAgent leverages metadata extraction tools, and reasons over metadata and image content in accordance with compliance policies to yield comprehensive decisions.
  • Figure 2: The architecture of CompAgent. Given compliance policies and an input image, it leverages provided tools to extract metadata, reasons over the policy, and iteratively refines its process until a final decision is made.
  • Figure 3: Representative examples showing CompAgent's compliance verification decisions. Compared to LlavaGuard and ground truth.
  • Figure 4: Proposed routing algorithm, where the routing node directs inputs by content category, metadata is extracted through specialized tools, fused at a metadata fusion node, and the final decision is made by an MLLM using the metadata, input image, and compliance policy.
  • Figure :
  • ...and 12 more figures