CompAgent: An Agentic Framework for Visual Compliance Verification
Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu
TL;DR
CompAgent tackles visual compliance verification by introducing an agentic framework that coordinates a Planning Agent, a modular Tool Suite, and a CVAgent to reason over images under evolving compliance policies. The Planning Agent selects relevant tools to gather evidence, while the CVAgent performs structured multimodal reasoning to output a rating, violation category, and rationale, all without requiring labeled data or fine-tuning. On LlavaGuard and UnsafeBench, CompAgent achieves state-of-the-art performance, exemplified by a Unsafe F1 of $0.93$ on LlavaGuard and $0.76$ on UnsafeBench, significantly surpassing prompt-based and fine-tuned baselines. The approach demonstrates strong generalization, interpretability, and training-free adaptability, offering a scalable path for automated visual compliance in dynamic policy environments and potential extensions to video content.
Abstract
Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent Multimodal Large Language Models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools-such as object detectors, face analyzers, NSFW detectors, and captioning models-and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A compliance verification agent then integrates image, tool outputs, and policy context to perform multimodal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and robust tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.
