Table of Contents
Fetching ...

Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models

Som Sagar, Aditya Taparia, Ransalu Senanayake

TL;DR

The paper tackles the problem of unwanted failures in large-scale vision and language models by introducing a post hoc failure discovery framework that uses deep reinforcement learning to map the failure landscape across actionable concepts $C$ and maximize a discrepancy measure $\Delta$ subject to a threshold $\epsilon$. It presents macroscopic and microscopic exploration strategies, three task-specific environments (image classification, text summarization, image generation), and methods for incorporating limited human feedback to guide restructuring via targeted fine-tuning (final-layer updates, HF-tuned summarizers, and LoRA-based diffusion models). Key contributions include formalizing failure under concept sets, demonstrating scalable RL-based discovery in high-dimensional spaces, and showing that structured fine-tuning can shift or reduce prominent failures while analyzing trade-offs with Wasserstein distance metrics and bias reduction. The work offers a practical, interpretable pipeline for pre-deployment auditing and post hoc remediation across CV, NLP, and VLM tasks, with potential impact on policy and governance workflows by providing actionable failure mappings and mitigation strategies.

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we introduce a post-hoc method that utilizes \emph{deep reinforcement learning} to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.

Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models

TL;DR

The paper tackles the problem of unwanted failures in large-scale vision and language models by introducing a post hoc failure discovery framework that uses deep reinforcement learning to map the failure landscape across actionable concepts and maximize a discrepancy measure subject to a threshold . It presents macroscopic and microscopic exploration strategies, three task-specific environments (image classification, text summarization, image generation), and methods for incorporating limited human feedback to guide restructuring via targeted fine-tuning (final-layer updates, HF-tuned summarizers, and LoRA-based diffusion models). Key contributions include formalizing failure under concept sets, demonstrating scalable RL-based discovery in high-dimensional spaces, and showing that structured fine-tuning can shift or reduce prominent failures while analyzing trade-offs with Wasserstein distance metrics and bias reduction. The work offers a practical, interpretable pipeline for pre-deployment auditing and post hoc remediation across CV, NLP, and VLM tasks, with potential impact on policy and governance workflows by providing actionable failure mappings and mitigation strategies.

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we introduce a post-hoc method that utilizes \emph{deep reinforcement learning} to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.
Paper Structure (35 sections, 11 equations, 35 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 11 equations, 35 figures, 4 tables, 1 algorithm.

Figures (35)

  • Figure 1: There are three main steps in the proposed failure discovery and mitigation framework. 1. Discover: We propose a deep RL-based method to explore the failure landscape with microscopic and macroscopic exploration strategies. It will discover regions where the model works and fails, with varying levels of confidence. 2. Summarize: Results are qualitatively and quantitatively summarized for the user to indicate preferences. 3. Restructure: Based on the user's preferences from the previous stage, the model can be fine-tuned to mitigate or shift away the failure modes to unlikely regions. The center image shows images generated by Stable Diffusion v1-4 for the prompt Create an image of a distinctive < artist> analyzing data on a computer in a < research center>. A user selects the most likely failure in terms of image quality from the summary report. The fine-tuned model, based on user preferences, has generated more naturalistic images.
  • Figure 2: a) A visualization of the failure landscape. b) We can observe sample failures, get quantitative distances. We see a shift in the failure mode (yellow) after fine-tuning.
  • Figure 3: Number of failures vs. steps for different classification models. After fine-tuning, it finds less failures. The most accurate model, EfficientNet, has the least difference after fine-tuning.
  • Figure 4: Failure mode shifts in (a) EfficientNet, (b) BART, and (c) Stable diffusion v1-4 after fine-tuning.
  • Figure 5: Improving gender bias
  • ...and 30 more figures

Theorems & Definitions (2)

  • Definition 2.1: Failure
  • Definition 4.1: Reduced Failures