Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

Lan Wang; Vishnu Boddeti; Sernam Lim

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

Lan Wang, Vishnu Boddeti, Sernam Lim

TL;DR

This work introduces a novel text-to-pose video editing method that can achieve effective action editing and even imaginary editing from counterfactual questions, and introduces a new evaluation dataset, WhatifVideo-1.0.

Abstract

We introduce a novel text-to-pose video editing method, ReimaginedAct. While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video. Moreover, our method can accept not only direct instructional text prompts but also `what if' questions to predict possible action changes. ReimaginedAct comprises video understanding, reasoning, and editing modules. First, an LLM is utilized initially to obtain a plausible answer for the instruction or question, which is then used for (1) prompting Grounded-SAM to produce bounding boxes of relevant individuals and (2) retrieving a set of pose videos that we have collected for editing human actions. The retrieved pose videos and the detected individuals are then utilized to alter the poses extracted from the original video. We also employ a timestep blending module to ensure the edited video retains its original content except where necessary modifications are needed. To facilitate research in text-to-pose video editing, we introduce a new evaluation dataset, WhatifVideo-1.0. This dataset includes videos of different scenarios spanning a range of difficulty levels, along with questions and text prompts. Experimental results demonstrate that existing video editing methods struggle with human action editing, while our approach can achieve effective action editing and even imaginary editing from counterfactual questions.

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 8 figures, 2 tables)

This paper contains 12 sections, 6 equations, 8 figures, 2 tables.

Introduction
Related Work
Text-to-pose Video Editing
Problem Definition
WhatifVideo-1.0 Dataset
Method
Diffusion Model
Text-to-pose Video Editing
Experiments
Implementation Detail
Baseline Comparisons
Conclusion

Figures (8)

Figure 1: Conventional Video editing and Text-to-pose video editing: (a) Conventional Video editing directly edits a video wu2023tuneqi2023fatezero. Similarly, Text-guided Video Editing uses a text prompt to edit the video's objects, background, style, or other attributes. (b) Text-to-pose video editing: directly manipulating human actions in videos using target prompts. Text-to-pose video editing - Question: a what-if question is asked that would dictate the necessary modifications to support the answer to the question. This is much more challenging than video editing as the question is potentially open-ended and calls for video editing capable of accomplishing any required modifications.
Figure 2: Overview of ReimaginedAct: Given a video and a question/instruction, ReimaginedAct first uses an LLM to obtain an answer. Using the answer as a query, we conduct pose matching, after which the retrieved pose is first aligned and merged with the original pose before being used to condition our diffusion model. To handle scenarios where there could be one or more individuals, ReimaginedAct also contains a pose editing module. The pose editing module runs a Grounded-SAM model that can disambiguate the individuals needing modification based on the LLM's response.
Figure 3: Overview of WhatifVideo-1.0 Dataset: (a) Different video categories. (b) Different scenarios with a single or multiple people and with human-object interactions. (c) Part of the dataset includes recorded videos with original videos, counterfactual questions, and associated ground truth counterfactual videos.
Figure 4: An example of a failure case. The top row is the input video, while the second is the output video. The LLM's answer contains two actions: "stop walking" and "look". When retrieving from the pose database, only the first action was retrieved, resulting in the final video showing the woman stopping but not looking at the dog. Our current version of ReimaginedAct contains pose videos that combine multiple actions, but not "stop walking and look". Having all possible permutations of poses in the database is also intractable. We will leave this shortcoming of ReimaginedAct to future work.
Figure 5: Example prompting template.
...and 3 more figures

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

TL;DR

Abstract

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

Authors

TL;DR

Abstract

Table of Contents

Figures (8)