FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video
Rotem Ezra, Hedi Zisling, Nimrod Berman, Ilan Naiman, Alexey Gorkor, Liran Nochumsohn, Eliya Nachmani, Omri Azencot
TL;DR
FreeSliders introduces a training-free, modality-agnostic approach to fine-grained diffusion control by estimating the Concept Slider update during inference, avoiding per-concept training and architecture-specific fine-tuning. The authors extend the existing CS benchmark to video and audio, and propose three modality-agnostic slider properties—range, smoothness, and preservation—along with new metrics SP, CR, and CSM to better capture perceptual and semantic changes. To address scale selection and non-linear traversal, they propose Automatic Saturation and Traversal Detection (ASTD), a two-stage procedure that detects saturation points and reparameterizes traversal for perceptually uniform edits, improving performance across modalities. Extensive experiments across image, video, and audio demonstrate plug-and-play, training-free concept control that outperforms baselines while maintaining practicality, with ASTD providing notable gains in multi-modal controllable generation. The work emphasizes principled evaluation, cross-modal applicability, and consideration of scalability and ethics in deploying such fine-grained controllable diffusion tools.
Abstract
Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/
