NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Taewon Kang; Ming C. Lin

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Taewon Kang, Ming C. Lin

TL;DR

This work presents a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

Abstract

Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

TL;DR

Abstract

Paper Structure (55 sections, 35 equations, 32 figures, 6 tables)

This paper contains 55 sections, 35 equations, 32 figures, 6 tables.

Introduction
Related Work
Negation and Compositionality in Vision-Language Models
Inference-Time Control and Constrained Diffusion
Method
Controlled Diffusion Dynamics
Semantic Decomposition of Linguistic Negation
Negation as Convex Feasibility
Minimal-Energy Projection
Stability and Convergence Properties
Temporal Scheduling
Unified Treatment of Linguistic Cases
Experiments and Results
Implementation Details
Benchmarking Datasets
...and 40 more sections

Figures (32)

Figure 1: Negation beyond object removal. Structural Functional Negation (SFN), Double Negation Sensitivity (DNS), and Scoped Negation Disambiguation (SND). Baseline diffusion models collapse negation into semantic inversion or scope errors, whereas our model enforces negation as a constraint on diffusion dynamics, yielding correct logical interpretation and stable scene composition. Qualitative results and comparisons with more video diffusion models are provided in Section \ref{['sec:qualitative']}.
Figure 2: Overview of our negation-aware framework via convex feasibility projection. Given a natural-language prompt containing linguistic negation (e.g., "a person holding a phone but not using it"), we first perform semantic decomposition into affirmed concepts $y^{+}$, negated concepts $y^{-}$, and scope structure $\mathcal{S}$. In classifier-free guidance (CFG), the reference guidance increment is defined as $\delta_{\mathrm{ref}} = \gamma(\epsilon_{\mathrm{text}} - \epsilon_{\mathrm{uncond}})$, which attracts the diffusion trajectory toward affirmed semantics but does not constrain negated variables. We construct a negation direction $a_t = \epsilon_{\mathrm{neg}} - \epsilon_{\mathrm{uncond}}$, representing the semantic increment that increases alignment with the negated concept. At each reverse-time step ($t=1 \rightarrow 0$), we enforce a half-space constraint $a_t^\top \delta \le b_t$ in guidance space and project the reference increment onto the feasible region, producing the corrected update $\delta_t^*$. A temporal scheduling strategy progressively tightens the constraint threshold $b_t$, allowing early structural formation while enforcing strict negation at later stages. The resulting denoising trajectory yields negation-compliant video generation without retraining the text-to-video diffusion model.
Figure 3: Representative qualitative comparison (SFN). We compare our model with state-of-the-art diffusion baselines (Mochi, HunyuanVideo, CogVideoX) on Structural Functional Negation (SFN): "A person holding a phone but not using it." While baselines often collapse the negation into unintended interaction (e.g., phone-using gestures) or fail to maintain the intended constraint, our model preserves the object presence and suppresses the prohibited behavior, demonstrating negation control beyond object removal.
Figure 4: Ablation study on negation control. We compare the our full model with two ablations---w/o Repulsive Energy and w/o Constraint Scheduling---and a strong diffusion baseline (Mochi) on a double-negation prompt (DNS): "A stage that is not unlit." The full model produces stable, correctly lit stage scenes. Removing repulsive energy weakens negation enforcement and introduces temporal instability (e.g., lighting flicker), while removing constraint scheduling disrupts early-stage structure formation, causing scale drift and unnatural light-source placement.
Figure 5: User study results for negation-aware video generation. 50 participants evaluated four anonymized methods across four criteria: Negation Satisfaction, Constraint Meaning Accuracy, Scene & Action Alignment, and Artifact Avoidance (left). Our method consistently achieved the highest average ratings across all dimensions (4.54–4.84), substantially outperforming Mochi, HunyuanVideo, and CogVideoX. In the overall preference comparison (right), 77.5% of votes favored our method, compared to 13.8% for HunyuanVideo, 8.8% for Mochi, and 0.0% for CogVideoX.
...and 27 more figures

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

TL;DR

Abstract

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (32)