Table of Contents
Fetching ...

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna

Abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves , while Gemini declines . Notably, these gains fall short of the improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.

Paper Structure

This paper contains 40 sections, 21 figures, 22 tables.

Figures (21)

  • Figure 1: Q&A examples from VideoNet. We provide two evaluation settings: multiple-choice and few-shot binary. The former focuses on the core task of domain-specific action recognition; the latter focuses on a model's ability to learn from in-context videos. (The prompts above have been simplified for succinctness.)
  • Figure 2: Video samples from all 7 categories and 37 domains in VideoNet. An https://tanu.sh/videonet/data of the benchmark's videos is available on the project website.
  • Figure 3: Benchmark data collection pipeline, as described in Section \ref{['subsec:benchmark_collection']}. Given an action name and definition, humans (1) find clips on the web, (2) remove outliers among these clips, and (3) fix the clip trimmings. This pipeline yields five well-trimmed clips per action.
  • Figure 4: Ablations on video input configurations in the binary 0-shot setting. Open models show limited gains from full-video input, indicating difficulty in effectively leveraging video context. (A notable exception is our model, which benefits significantly.) GPT-5.4 shows only a slight improvement at higher fps, suggesting that test-time scaling via denser video sampling is insufficient for solving domain-specific action recognition.
  • Figure 5: Binary few-shot accuracy of VLMs and humans with $k$ in-context video demonstrations. Humans (dotted lines) benefit significantly more than models (solid lines). Among models, there is great variation in their ability to exploit few-shot examples. For example, Gemini 3.1 Pro (red) loses 4.8 percentage points of accuracy while Qwen3-VL (green) gains 7.0 points from $k=0$ to $k=3$ in-context examples.
  • ...and 16 more figures