Table of Contents
Fetching ...

TIM: A Time Interval Machine for Audio-Visual Action Recognition

Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen

TL;DR

TIM tackles long untrimmed audio-visual video understanding by introducing modality-specific time interval queries that guide a transformer to attend to relevant intervals and surrounding context. A Time Interval MLP encodes interval queries, and a masked transformer processes concatenated, interval-encoded features to predict actions within queried intervals, with a temporal distance loss to reinforce temporal relations. The approach achieves SOTA recognition on EPIC-KITCHENS-100 and EPIC-SOUNDS, strong detection results with dense multi-scale queries, and notable gains on AVE and the Perception Test, highlighting the value of explicit temporal interval modeling and cross-modal integration. The framework is end-to-end and scalable, offering practical impact for fine-grained, multi-modal action understanding in long videos.

Abstract

Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

TIM: A Time Interval Machine for Audio-Visual Action Recognition

TL;DR

TIM tackles long untrimmed audio-visual video understanding by introducing modality-specific time interval queries that guide a transformer to attend to relevant intervals and surrounding context. A Time Interval MLP encodes interval queries, and a masked transformer processes concatenated, interval-encoded features to predict actions within queried intervals, with a temporal distance loss to reinforce temporal relations. The approach achieves SOTA recognition on EPIC-KITCHENS-100 and EPIC-SOUNDS, strong detection results with dense multi-scale queries, and notable gains on AVE and the Perception Test, highlighting the value of explicit temporal interval modeling and cross-modal integration. The framework is end-to-end and scalable, offering practical impact for fine-grained, multi-modal action understanding in long videos.

Abstract

Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM
Paper Structure (22 sections, 9 equations, 12 figures, 17 tables)

This paper contains 22 sections, 9 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Time Interval Machine (TIM): Top: Given a visual and auditory stream input, the ongoing action in a particular time interval is determined by a query specifying the start and end time of the interval, along with the modality of interest. Bottom: TIM can query for visual (e.g. 'Rinse Sponge') and auditory (e.g. 'Water') action classes, as well as distinguish between overlapping actions within the same modality ('Glass Collision' and 'Scrub / Scrape').
  • Figure 2: Overview of the Time Interval Machine (TIM). The model ingests a sequence of audio and visual features from a video, with each feature time-stamped by the temporal interval it spans, and encoded with its modality. To infer the action occurring over a temporal interval (a visual or audio event) a query is formed specifying the interval and modality of interest.
  • Figure 3: Illustration of the Time Interval MLP $I(\cdot)$. It inputs the two dimensional vector, start and end times of an interval, and produces a single vector, which can be concatenated along the channel dimension to either input features or [CLS] tokens. The figure shows three time interval inputs and three corresponding outputs. Note that in practice, time intervals are ingested simultaneously.
  • Figure 4: Qualitative results for all datasets. PRED: Prediction by TIM, TIQ: Time Interval Queries, V/AGT: Visual/Audio Ground Truth.
  • Figure 5: TSNE plot for time encodings $I(\cdot)$ on all datasets. In each plot, we use colour maps to indicate encodings of the time interval's duration (left), start time (middle) and end time (right).
  • ...and 7 more figures