Table of Contents
Fetching ...

LLaVAction: evaluating and training multi-modal large language models for action understanding

Haozhe Qi, Shaokai Ye, Alexander Mathis, Mackenzie W. Mathis

TL;DR

This work targets fine-grained action understanding with multimodal large language models by reframing EPIC-KITCHENS-100 into EPIC-KITCHENS-100-MQA, highlighting deficiencies of existing MLLMs in discriminating closely related actions. It introduces LLaVAction, which uses a learnable action token and a two-stage pipeline to produce structured actions, and complements this with a data reformulation regime including hard action recognition, captioning, QA, temporal detection, and prior-action cues. Through extensive experiments, LLaVAction achieves substantial gains on EPIC-KITCHENS-100-MQA (up to 21 points over GPT-4o), SOTA action recognition on several benchmarks, and strong zero-shot/generalization performance across diverse datasets and MLLM benchmarks. The results demonstrate that targeted hard-example mining and action-aware architectural components can unlock robust, generalizable action understanding in MLLMs with practical impact for complex action tasks.

Abstract

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. Emerging multimodal large language models (MLLMs) are promising candidates, but their fine-grained action understanding ability has not been fully examined. In this work, we reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action recognition datasets, into a MLLM benchmark (EPIC-KITCHENS-100-MQA). We demonstrate that when we sample difficult answers based on specialist models as distractors, leading MLLMs struggle to recognize the correct actions. How can we increase the performance of MLLMs? We curated a supervised finetuning dataset that includes `hard' action recognition, temporal detection, captioning, and free-form question answering to improve models' diverse action understanding capabilities. We introduce a new model called LLaVAction that adds an action token to boost models' attention on visual tokens and a two-stage pipeline to obtain structured actions. LLaVAction greatly improves the MLLMs' ability of action understanding, achieving strong improvements on both MLLM benchmarks (21 points in accuracy over GPT-4o on EPIC-KITCHENS-100-MQA) and established action recognition benchmarks, suggesting that our methods prepare MLLMs to be a promising path forward for complex action tasks. Code, data, the benchmark, and models are available at https://github.com/AdaptiveMotorControlLab/LLaVAction.

LLaVAction: evaluating and training multi-modal large language models for action understanding

TL;DR

This work targets fine-grained action understanding with multimodal large language models by reframing EPIC-KITCHENS-100 into EPIC-KITCHENS-100-MQA, highlighting deficiencies of existing MLLMs in discriminating closely related actions. It introduces LLaVAction, which uses a learnable action token and a two-stage pipeline to produce structured actions, and complements this with a data reformulation regime including hard action recognition, captioning, QA, temporal detection, and prior-action cues. Through extensive experiments, LLaVAction achieves substantial gains on EPIC-KITCHENS-100-MQA (up to 21 points over GPT-4o), SOTA action recognition on several benchmarks, and strong zero-shot/generalization performance across diverse datasets and MLLM benchmarks. The results demonstrate that targeted hard-example mining and action-aware architectural components can unlock robust, generalizable action understanding in MLLMs with practical impact for complex action tasks.

Abstract

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. Emerging multimodal large language models (MLLMs) are promising candidates, but their fine-grained action understanding ability has not been fully examined. In this work, we reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action recognition datasets, into a MLLM benchmark (EPIC-KITCHENS-100-MQA). We demonstrate that when we sample difficult answers based on specialist models as distractors, leading MLLMs struggle to recognize the correct actions. How can we increase the performance of MLLMs? We curated a supervised finetuning dataset that includes `hard' action recognition, temporal detection, captioning, and free-form question answering to improve models' diverse action understanding capabilities. We introduce a new model called LLaVAction that adds an action token to boost models' attention on visual tokens and a two-stage pipeline to obtain structured actions. LLaVAction greatly improves the MLLMs' ability of action understanding, achieving strong improvements on both MLLM benchmarks (21 points in accuracy over GPT-4o on EPIC-KITCHENS-100-MQA) and established action recognition benchmarks, suggesting that our methods prepare MLLMs to be a promising path forward for complex action tasks. Code, data, the benchmark, and models are available at https://github.com/AdaptiveMotorControlLab/LLaVAction.

Paper Structure

This paper contains 40 sections, 5 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: LLaVAction-7B. Left: Qualitative inspection of distractors. We show an example clip with labels from random choices (which empirically is easy to solve), vs. our proposed harder benchmark with action labels generated by a SOTA specialist (TIM chalk2024tim). Our hard example mining strategy can automatically explore challenges such as temporal order and similar objects that are curated in other benchmarks. Right: While GPT-4o is strong when identifying correct answers among few random choices due to the large number of possible actions, it suffers in the harder benchmarking regime, and our method, LLaVAction outperforms GPT-4o.
  • Figure 2: LLaVAction pipeline. Trained with our action-related MLLM reformulated data, LLaVAction outputs captions, action tokens and open-ended and multi-choice QAs. Our two-stage pipeline further enables LLaVAction to output structured action.
  • Figure 3: Qualitative results. LLaVAction-7B consistently outperforms GPT-4o and LLaVA-Video-7B when tested on hard distractors. Bold option denotes ground truth, and the icons denote the selection of the models. See also Appendix \ref{['sec:qualitative_examples']}.
  • Figure 4: Qualitative attention for one clip. Anecdotally, LLaVA-Video mainly attends to the wooden spatula that is placed in the drawer, LLaVAction also attends to the arms and, correctly, the plastic spatula that is being taken. We quantify visual-text token correlations in the main text.
  • Figure 5: Action labels vs. narrations. Blue option denotes ground truth and the pink option denotes LLaVAction-7B's prediction. Action labels usually reduce multiple nouns into one noun, resulting in ambiguity that could mislead a MLLM. Note that the narration also contains crucial particles with the phrasal verbs to clarify the meaning such as "put down", "put into".
  • ...and 3 more figures