Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe; Eric Hedlin; Helge Rhodin; Kwang Moo Yi

Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe, Eric Hedlin, Helge Rhodin, Kwang Moo Yi

Abstract

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

Making Video Models Adhere to User Intent with Minor Adjustments

Abstract

Paper Structure (45 sections, 11 equations, 16 figures, 6 tables, 1 algorithm)

This paper contains 45 sections, 11 equations, 16 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Text-to-image generative models.
Text-to-image generative models with box control.
Text-to-video generative models with box control.
Method
Preliminary: Trailblazer ma2024trailblazer
Differentiable attention map editing
Smooth masks.
Differentiable editing.
Optimizing bounding boxes to align with attention maps
Preventing attention from completely ignoring outside the box.
Regularizing to remain close to user intent.
Final optimization objective.
Implementation details
...and 30 more sections

Figures (16)

Figure 1: Teaser -- We show example bounding box controlled video generations with (left) Trailblazer ma2024trailblazer in T2V-Turbo li2024t2v and (right) the original Trailblazer ma2024trailblazer respectively. On top is the original control signal and below is our adjusted bounding boxes. We show the original bounding boxes as blue and the adjusted boxes as orange. While the modification is subtle, the difference in the quality of generation is large. We modify bounding boxes to adhere better to the cross-attention maps within the video models.
Figure 2: Overview -- We inject bounding box control for video diffusion models by editing their cross attention maps within the network. However, not all such edits are friendly to video diffusion models as they are not trained with such edits. Thus, when applying these edits, we make sure that this editing process is differentiable (\ref{['sec:edit']}) and adjust the edit parameters in a way such that the network behaves as intended---attention being focused on desired regions (\ref{['sec:opt']}). We show the original bounding boxes in blue and the adjusted bounding boxes in orange. Though the adjustments are minimal and close to the original user input, they create a drastic difference in terms of video generation quality and adherence to the bounding boxes.
Figure 3: Example of attention map editing -- We show an example of the generated video frame with the desired user control bounding box, and the associated attention map edits. While effective, Trailblazer ma2024trailblazer relies on a replacement operation that is not differentiable with respect to the box parameters. Our method, on the other hand, performs a smooth differentiable edit.
Figure 4: User study interface -- We show a sample of the user study interface for prompt 'the lion is walking'. Users are asked to select their preference over a set of five video generations, provided in a random order. Users are allowed to select multiple choices or no preference. In this example, only the generation results are shown without user control to isolate quality vs. control. For this question, C and E look preferable, whereas the control is located at the bottom-left where the lion is at A, B, and D. C and E completely ignore user control, yet generate a preferred view of the prompt. As we are interested in controlled generation, we consider both answers to the quality preference and the trajectory faithfulness when evaluating the quality of generations.
Figure 5: Qualitative results comparing Peekaboo jain2024peekaboo and our method (full). -- Each row shows two representative frames per method (left to right: Peekaboo, Ours). Our method yields better generation quality and improved alignment with the text prompt.
...and 11 more figures

Making Video Models Adhere to User Intent with Minor Adjustments

Abstract

Making Video Models Adhere to User Intent with Minor Adjustments

Authors

Abstract

Table of Contents

Figures (16)