FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim; Seonghun Jeong; Youngjun Kwak

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak

TL;DR

FENCE addresses the lack of finance-specific jailbreak resources for multimodal models by introducing a bilingual Korean–English dataset that emphasizes image-grounded threats. The authors construct FENCE through a three-step pipeline that transforms benign queries into harmful ones, collects query-relevant finance images, and fuses text and imagery to generate multimodal samples for training. They show that FENCE exposes vulnerabilities across commercial and open-source VLMs, with GPT-4o showing non-negligible attack success rates, and that a baseline detector trained on FENCE achieves 99% in-distribution accuracy and strong cross-benchmark performance. Furthermore, fine-tuning small guardrail models on FENCE substantially improves defense rates, demonstrating the dataset's practical value for building safer finance AI systems. Overall, FENCE offers a robust resource for domain-specific safety evaluation and guardrail training in high-stakes financial applications.

Abstract

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

TL;DR

Abstract

Paper Structure (29 sections, 4 figures, 7 tables)

This paper contains 29 sections, 4 figures, 7 tables.

Introduction
Related Work
Text-based attacks
Word substitution
Prefix and suffix manipulation
Role-playing
Image-based attacks
Query-related images
Typo & FigStep
Datasets
Open Datasets
JailBreakV-28K
FigStep
MM-SafetyBench
FENCE
...and 14 more sections

Figures (4)

Figure 1: Jailbreaking datasets are classified into Text-based Attacks (TA) and Image-based Attacks (IA) based on the location of the harmful content. Harmful content is marked with red text and outlined by a red dotted line.
Figure 2: Distribution of FENCE across 15 financial categories representing realistic use cases.
Figure 3: Workflow for constructing FENCE, consisting of three stages: (1) transforming benign queries into harmful ones using a two-step prompting setup with GPT-4o (role-playing and evaluation), (2) collecting query-relevant financial images via keyword search, and (3) fusing text and images to generate multimodal jailbreak samples.
Figure 4: t-SNE visualization of harmful queries from existing datasets and FENCE using embeddings from the text-embedding-3-small model. FENCE’s queries exhibit a higher degree of semantic overlap with benign queries, suggesting that distinguishing harmful from benign inputs is more challenging for jailbreak detection systems.

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

TL;DR

Abstract

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (4)