Table of Contents
Fetching ...

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro

TL;DR

This work addresses the robustness of reasoning segmentation models to semantically equivalent yet adversarial paraphrases. It introduces SPARTA, a black-box method that optimizes paraphrases in the latent space of a pretrained text autoencoder (SONAR) via reinforcement learning to maximize segmentation degradation measured by IoU drops. An automatic evaluation protocol, augmented by LLM-based paraphrase detection and semantic similarity filtering, validates paraphrase quality and attack effectiveness; human studies further align automatic scoring with judgments of validity. Across ReasonSeg and LLMSeg-40k, SPARTA outperforms baselines by up to 2x and reveals that current reasoning segmentation models remain vulnerable to carefully crafted paraphrases under strict grammatical and semantic constraints. The work provides a foundation for evaluating and improving the robustness of multimodal vision-language systems, with implications for safer and more reliable AI deployments.

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

TL;DR

This work addresses the robustness of reasoning segmentation models to semantically equivalent yet adversarial paraphrases. It introduces SPARTA, a black-box method that optimizes paraphrases in the latent space of a pretrained text autoencoder (SONAR) via reinforcement learning to maximize segmentation degradation measured by IoU drops. An automatic evaluation protocol, augmented by LLM-based paraphrase detection and semantic similarity filtering, validates paraphrase quality and attack effectiveness; human studies further align automatic scoring with judgments of validity. Across ReasonSeg and LLMSeg-40k, SPARTA outperforms baselines by up to 2x and reveals that current reasoning segmentation models remain vulnerable to carefully crafted paraphrases under strict grammatical and semantic constraints. The work provides a foundation for evaluating and improving the robustness of multimodal vision-language systems, with implications for safer and more reliable AI deployments.

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

Paper Structure

This paper contains 46 sections, 13 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Example of an adversarial paraphrase generated by our proposed SPARTA method. The SPARTA produces grammatically correct paraphrases that preserve the original semantic content while significantly degrading segmentation performance.
  • Figure 2: Success rate (SR) as a function of IoU-drop threshold for adversarial paraphrases with LLM score 5. Results are shown for the LISA-7B model on the ReasonSeg dataset (left) and LLMSeg-40k dataset (right).
  • Figure 3: Examples of adversarial paraphrases obtained using the proposed SPARTA method. SPARTA produces grammatically correct paraphrases that preserve the original query meaning while substantially degrading segmentation performance.
  • Figure 4: Scatter plot of SONAR embedding dim 654 versus tokenized text length. A strong negative correlation ($r=-0.956$, $R^2=0.913$) shows that this dimension encodes sequence length, with shorter sentences having higher embedding values. The red line indicates a linear fit.
  • Figure 5: t-SNE projections of sentence embeddings from two encoders.Upper: CLIP encoder; bottom: SONAR encoder. Each grid contains four panels for sentences of length $\le\{20,25,30,35\}$ words. Colours designate paraphrase groups: sentences sharing the same hue are semantically equivalent variants of one another. See Figure \ref{['fig:sonar_clip']} for quantitative cluster quality. Since DeCap and GVAE exhibit extremely low restoration quality (Table \ref{['tab:restore']}), their embedding spaces are omitted from visualization.
  • ...and 10 more figures