Table of Contents
Fetching ...

FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video

Rotem Ezra, Hedi Zisling, Nimrod Berman, Ilan Naiman, Alexey Gorkor, Liran Nochumsohn, Eliya Nachmani, Omri Azencot

TL;DR

FreeSliders introduces a training-free, modality-agnostic approach to fine-grained diffusion control by estimating the Concept Slider update during inference, avoiding per-concept training and architecture-specific fine-tuning. The authors extend the existing CS benchmark to video and audio, and propose three modality-agnostic slider properties—range, smoothness, and preservation—along with new metrics SP, CR, and CSM to better capture perceptual and semantic changes. To address scale selection and non-linear traversal, they propose Automatic Saturation and Traversal Detection (ASTD), a two-stage procedure that detects saturation points and reparameterizes traversal for perceptually uniform edits, improving performance across modalities. Extensive experiments across image, video, and audio demonstrate plug-and-play, training-free concept control that outperforms baselines while maintaining practicality, with ASTD providing notable gains in multi-modal controllable generation. The work emphasizes principled evaluation, cross-modal applicability, and consideration of scalability and ethics in deploying such fine-grained controllable diffusion tools.

Abstract

Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/

FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video

TL;DR

FreeSliders introduces a training-free, modality-agnostic approach to fine-grained diffusion control by estimating the Concept Slider update during inference, avoiding per-concept training and architecture-specific fine-tuning. The authors extend the existing CS benchmark to video and audio, and propose three modality-agnostic slider properties—range, smoothness, and preservation—along with new metrics SP, CR, and CSM to better capture perceptual and semantic changes. To address scale selection and non-linear traversal, they propose Automatic Saturation and Traversal Detection (ASTD), a two-stage procedure that detects saturation points and reparameterizes traversal for perceptually uniform edits, improving performance across modalities. Extensive experiments across image, video, and audio demonstrate plug-and-play, training-free concept control that outperforms baselines while maintaining practicality, with ASTD providing notable gains in multi-modal controllable generation. The work emphasizes principled evaluation, cross-modal applicability, and consideration of scalability and ethics in deploying such fine-grained controllable diffusion tools.

Abstract

Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/

Paper Structure

This paper contains 65 sections, 9 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Sliders: (a) image: gradually increasing/decreasing smile intensity; (b) video: progressively increasing ocean waviness in a sailing scene; (c) audio: spectrograms of a cat’s meow with rising energy in dominant frequencies from top to bottom, indicating successful concept control.
  • Figure 2: Examples illustrating limitations of the $\Delta$CLIP metric (black boxes): a high score despite abrupt transitions and flat regions at the slider ends (left), and a lower score despite a clear and smooth concept change across intervals (right). Blue and red boxes denote the CR and CSM metrics, respectively, introduced in Sec. \ref{['sec:sliders_metrics']}.
  • Figure 3: Default CS scales often yield suboptimal results as shown in the top row, while the bottom row shows ASTD-optimized scales and step sizes. Black boxes mark sampled $\eta$ values.
  • Figure 4: Automatic Saturation and Traversal Detection (ASTD) for concept "smiling". Left: saturation detection via $r(x,\eta)$. Right: traversal reparameterization for smoother progression.
  • Figure 5: Video sliders for the concept "mountain hikers" with different backbones.
  • ...and 9 more figures