Table of Contents
Fetching ...

T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Aditi Naiknaware, Salimeh Sekeh

Abstract

Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Abstract

Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.
Paper Structure (24 sections, 8 theorems, 49 equations, 4 figures, 10 tables, 4 algorithms)

This paper contains 24 sections, 8 theorems, 49 equations, 4 figures, 10 tables, 4 algorithms.

Key Result

theorem 1

(Main Theorem) Let $\mathbb{P}^{t,cov}$ and $\mathbb{P}_{test}^{t,sem}$ be the covariate-shifted OOD and semantic OOD distribution. Denote $GErr_{t+1}(f)$ the generalization error at time $t$. Let $\mathcal{L}_{reg}$ be the OOD detection loss devised for MSP detectors hendrycks2019deepanomalydetecti And $C_{t\rightarrow t+1}=C_{t+1}-C_t+B_t+Z_t$ and $\delta_t$ are constants and $\overline{\delta}_

Figures (4)

  • Figure 1: T-QPM Overview: At each timestep, ID images and their covariate-shifted views are encoded to build timestep-specific visual prototypes alongside a fixed ID Text Bank. At inference, four cross-modal scores between the test image, caption, and ID representations are fused to produce the final OOD decision.
  • Figure 2: ID classification accuracy on clean (left) and Gaussian blur-shifted (right) test sets across all timesteps. T-QPM consistently outperforms DPM under both conditions, with the gap widening substantially under covariate shift. ID dataset: Clear100, OOD dataset: COCO with captions
  • Figure 3: ID classification accuracy on clean (left) and JPEG shifted-shifted (right) test sets across all timesteps. T-QPM consistently outperforms DPM under both conditions, with the gap widening substantially under covariate shift. ID dataset: Clear100, OOD dataset: COCO with captions
  • Figure 4: Hyperparameter sweeps (AUROC) for $\beta$ (left), $\eta$ (center), and $\gamma_{\text{cap}}$ (right).

Theorems & Definitions (8)

  • theorem 1
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • theorem 2