Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Himangi Mittal; Nakul Agarwal; Shao-Yuan Lo; Kwonjoon Lee

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee

TL;DR

This work tackles the problem of action anticipation with an emphasis on plausibility, introducing PlausiVL, a large video-language model that leverages a Q-former-based visual encoder to align video features with an LLM. It introduces two objective functions—$L_{plau}$, which uses counterfactuals generated from temporal and verb-noun constraints to learn temporally plausible futures, and $L_{rep}$, which imposes a long-horizon penalty to reduce repetition and increase diversity. The combination of these losses yields more temporally accurate and diverse plausible action sequences, demonstrated on Ego4D and EPIC-Kitchens-100 with clear gains over strong baselines. This approach enhances the realism and usefulness of predicted futures for real-world decision-making and planning in AI systems.

Abstract

We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

TL;DR

, which uses counterfactuals generated from temporal and verb-noun constraints to learn temporally plausible futures, and

, which imposes a long-horizon penalty to reduce repetition and increase diversity. The combination of these losses yields more temporally accurate and diverse plausible action sequences, demonstrated on Ego4D and EPIC-Kitchens-100 with clear gains over strong baselines. This approach enhances the realism and usefulness of predicted futures for real-world decision-making and planning in AI systems.

Abstract

Paper Structure (16 sections, 9 equations, 6 figures, 8 tables)

This paper contains 16 sections, 9 equations, 6 figures, 8 tables.

Introduction
Related Works
Method
Model Architecture
Training
Plausible Action Sequence Learning loss
Long-Horizon Action Repetition Loss
Experiments
Implementation Details
Experimental Setup
Discussion of Results
Conclusion
Implementation Details
Metrics
Quantitative Analysis
...and 1 more sections

Figures (6)

Figure 1: We present a large video-language model for learning to anticipate action sequences that are plausible in the real-world. We show an example of a kitchen-based environment. By using a large video-language model , we leverage their generative capabilities to anticipate future actions and further train the model with two devised objective functions: plausible action sequence learning loss and long-horizon action repetition loss. Without the plausible action sequence learning loss, the model has less temporal understanding and generates a temporally implausible action sequence of cook omlette$\not \rightarrow$crack eggs. Similarly, without the long-horizon action repetition loss, the model generates less diverse actions and repeats the same action, whisk eggs$\rightarrow$whisk eggs$\rightarrow$whisk eggs. When training the model with the two objective functions combined, our method is able to generate plausible action sequences which are temporally accurate, crack eggs$\rightarrow$cook omlette and more diverse with less repetition, whisk eggs$\rightarrow$whisk eggs$\rightarrow$cook omlette.
Figure 2: Model diagram:(a) PlausiVL: Given a video, a frozen visual encoder a Q-former with $k$ number of query tokens is used to extract frame level representations which are further concatenated with a frame position embedding layer to add temporal understanding. Next, the representations are passed through the video Q-former and a linear layer is added to project these features into the LLM space. These visual embeddings (visual prompts) and are concatenated with text-prompts to get the desired output text (Sec \ref{['sec:model_architecture']}), (b) Augmentation: For plausible action anticipation, we use logical rules to create counterfactual implausible action sequences. Given an input video, we create a positive augmentation of the video and a negative augmentation by using temporal logical and verb-noun action pair constraints (Sec \ref{['sec:Lplau']}). (c) Objective Functions and Training: We train our model with two losses: (i) Plausible Action Sequence Learning Loss (${\mathcal{L}_{\text{\small \tt plau}}}$) which aligns the original video-plausible text pair closer to the positive augmentation of video-plausible text, and brings the original video-plausible text far apart from the video-counterfactual text. (Sec \ref{['sec:Lplau']}), (ii) long-horizon action repetition loss that ensures diverse and less repetitive actions by adding a higher penalty to the later tokens (mix mixture and wipe hands) and lower penalty to immediate future actions (pour water, pour water). The graph shows the linearly increasing $\gamma$ penalty for the tokens over the long-horizon (Sec \ref{['sec:Lrep']}).
Figure 3: Qualitative Results: Given a video, the top blue box shows the prediction from PlausiVL and the green box contains the ground truth action sequence for reference. We can observe that PlausiVL is able to generate action sequences that satisfy the temporal logic constraints and are diverse with less repetitions. The predicted action sequence is also closer to the ground truth action sequence.
Figure 4: Analysis of $\tau_a$ vs. verb-noun class-mean Top-5 recall (%) accuracy ($\uparrow$) on EK100.
Figure 5: Analysis of plausibility in generated action sequence: Black line represents our method and orange is the baseline, Video-LLaMA. Comparing the two line plots, we can observe that PlausiVL follows more number of temporal and action constraints over training than Video-LLaMA indicating that the objective functions ${\mathcal{L}_{\text{\small \tt plau}}}$ and ${\mathcal{L}_{\text{\small \tt rep}}}$ are helping the model to learn temporal cues needed to generate plausible action sequences for action anticipation.
...and 1 more figures

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

TL;DR

Abstract

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)