Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Bo Ma; Jinsong Wu; Wei Qi Yan

Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Bo Ma, Jinsong Wu, Wei Qi Yan

Abstract

Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/ε$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}

Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Abstract

). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean

std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}

Paper Structure (50 sections, 10 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 50 sections, 10 equations, 10 figures, 8 tables, 2 algorithms.

Introduction
Scope and relevance to TNNLS.
Related Work
Differential Privacy and Privacy Auditing
Hierarchical and Multimodal Representations
Privacy in Vision and Vision-Language Models
External Comparison Methods Used in This Work
Expectation-Maximization and Microaggregation
Method
Formal Problem Definition
Inputs and output (reference-relative).
Preliminary and Notation
Sensitive Concepts and Scores
Model Structure
Bottom-Up Strategy (BUA)
...and 35 more sections

Figures (10)

Figure 1: BUA in a YOLO-style detector: bottom-up aggregation and feedback weights for sensitive regions.
Figure 2: TDA and BUA in VLM: top-down and bottom-up feature search over the vision encoder layers (e.g., ViT blocks); layer-wise groups $\mathcal{G}^{(n)}_i$, $\mathcal{G}^{(s)}_i$ feed EMPA for feature-level budget-alignment assessment.
Figure 3: BUA vs. TDA deviation on YOLO (MOT20).
Figure 4: BUA vs. TDA deviation on PPDPTS (MOT20).
Figure 5: Deviation: TDA+EMPA vs. BUA+EMPA (MOT20, PPDPTS $\epsilon=0.001$).
...and 5 more figures

Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Abstract

Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Authors

Abstract

Table of Contents

Figures (10)