Table of Contents
Fetching ...

Video Relationship Detection Using Mixture of Experts

Ala Shaabana, Zahra Gharaee, Paul Fieguth

TL;DR

MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity.

Abstract

Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.

Video Relationship Detection Using Mixture of Experts

TL;DR

MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity.

Abstract

Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
Paper Structure (17 sections, 13 equations, 4 figures, 3 tables)

This paper contains 17 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A Mixture of Experts (MoE) layer as described by Shazeer et al.shazeer2017.
  • Figure 2: Visual relationship detection framework proposed by shang2021video, which is used as the basis of our expert.
  • Figure 3: An illustration of the MoE-VRD architecture proposed in this article. Raw RGB images are taken as input; for each given image frame the subject and object tracklets are extracted and given to the feature extraction network together with bounding box information in order to generate visual and relative positional features representing all three entities: subject, predicate and object. The visual and positional features are applied as the input to our experts and gating networks. Every expert outputs a score corresponding to each entity, which represents both visual and preferential predictions. The gating network outputs a sparsely gated vector, which evaluates each expert's learning. Selecting the top $K$ experts, the sum-product of the sparsely gated expert scores is calculated and represented as the output of our MoE-VRD architecture.
  • Figure 4: mAP of the MoE-VRD approach having $N=10$ experts, as a function of $K$ during training. Note that performance drops after $K=2$; due to the averaging nature of the architecture before the final output, such that well-performing experts may become drowned out by more poorly performing peers if $K$ is set too large.