FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak
TL;DR
FENCE addresses the lack of finance-specific jailbreak resources for multimodal models by introducing a bilingual Korean–English dataset that emphasizes image-grounded threats. The authors construct FENCE through a three-step pipeline that transforms benign queries into harmful ones, collects query-relevant finance images, and fuses text and imagery to generate multimodal samples for training. They show that FENCE exposes vulnerabilities across commercial and open-source VLMs, with GPT-4o showing non-negligible attack success rates, and that a baseline detector trained on FENCE achieves 99% in-distribution accuracy and strong cross-benchmark performance. Furthermore, fine-tuning small guardrail models on FENCE substantially improves defense rates, demonstrating the dataset's practical value for building safer finance AI systems. Overall, FENCE offers a robust resource for domain-specific safety evaluation and guardrail training in high-stakes financial applications.
Abstract
Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.
