Table of Contents
Fetching ...

Actions and Objects Pathways for Domain Adaptation in Video Question Answering

Safaa Abdullahi Moallim Mohamud, Ho-Young Jung

TL;DR

The paper tackles generalization in VideoQA across unseen domains without full fine-tuning of large pretrained models. It proposes AOPath, a brain-inspired, two-pathway classifier that dissociates pretrained features into action and object streams via a no-trainable AOExtractor, using cosine similarity to dictionaries to obtain domain-agnostic representations. On TVQA genre-based splits, AOPath achieves notable improvements over conventional classifiers while using orders of magnitude fewer trainable parameters than large baselines. The approach is supported by comprehensive ablations and qualitative analyses, highlighting its efficiency, interpretability, and potential for robust cross-domain VideoQA performance.

Abstract

In this paper, we introduce the Actions and Objects Pathways (AOPath) for out-of-domain generalization in video question answering tasks. AOPath leverages features from a large pretrained model to enhance generalizability without the need for explicit training on the unseen domains. Inspired by human brain, AOPath dissociates the pretrained features into action and object features, and subsequently processes them through separate reasoning pathways. It utilizes a novel module which converts out-of-domain features into domain-agnostic features without introducing any trainable weights. We validate the proposed approach on the TVQA dataset, which is partitioned into multiple subsets based on genre to facilitate the assessment of generalizability. The proposed approach demonstrates 5% and 4% superior performance over conventional classifiers on out-of-domain and in-domain datasets, respectively. It also outperforms prior methods that involve training millions of parameters, whereas the proposed approach trains very few parameters.

Actions and Objects Pathways for Domain Adaptation in Video Question Answering

TL;DR

The paper tackles generalization in VideoQA across unseen domains without full fine-tuning of large pretrained models. It proposes AOPath, a brain-inspired, two-pathway classifier that dissociates pretrained features into action and object streams via a no-trainable AOExtractor, using cosine similarity to dictionaries to obtain domain-agnostic representations. On TVQA genre-based splits, AOPath achieves notable improvements over conventional classifiers while using orders of magnitude fewer trainable parameters than large baselines. The approach is supported by comprehensive ablations and qualitative analyses, highlighting its efficiency, interpretability, and potential for robust cross-domain VideoQA performance.

Abstract

In this paper, we introduce the Actions and Objects Pathways (AOPath) for out-of-domain generalization in video question answering tasks. AOPath leverages features from a large pretrained model to enhance generalizability without the need for explicit training on the unseen domains. Inspired by human brain, AOPath dissociates the pretrained features into action and object features, and subsequently processes them through separate reasoning pathways. It utilizes a novel module which converts out-of-domain features into domain-agnostic features without introducing any trainable weights. We validate the proposed approach on the TVQA dataset, which is partitioned into multiple subsets based on genre to facilitate the assessment of generalizability. The proposed approach demonstrates 5% and 4% superior performance over conventional classifiers on out-of-domain and in-domain datasets, respectively. It also outperforms prior methods that involve training millions of parameters, whereas the proposed approach trains very few parameters.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The proposed AOPath task-specific classifier.
  • Figure 2: The AOExtractor module for audio and text features.
  • Figure 3: The pathways classifier of the proposed approach.
  • Figure 4: Comparison of different classifiers with respect to average accuracy, floating-point operations (FLOPs), and the number of trainable weights
  • Figure 5: Qualitative results of attention weights in the pathways classifier. The green-colored boxes display the pathway weights for audio features, while the orange-colored boxes depict the pathway weights for text features. Boxes outlined in blue represent the weights of the object pathways, and those outlined in red represent the weights of the action pathways.