Table of Contents
Fetching ...

Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty

Victor Croisfelt, João Henrique Inacio de Souza, Shashi Raj Pandey, Beatriz Soret, Petar Popovski

TL;DR

This work proposes a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement.

Abstract

Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.

Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty

TL;DR

This work proposes a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement.

Abstract

Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.

Paper Structure

This paper contains 12 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: A distributed system streams unimodal auditory and visual data covering overlapping . At the , a wrapper enables non-blocking inference by aligning delayed multimodal packets with a pre-trained token-based - pipeline. Limitations of reference-modality SotA methods Li2021SpeculativeInferenceWang2023PATCHWu2024AdaFlowXu2024MLLMInference are demonstrated via two adverse scenarios.
  • Figure 2: Snapshot of a wrapped pre-trained - at reception time $\tau$. All wrapper operations are synchronized by a -based clock. At $\tau$, unimodal streaming data sources provide modality-specific packets—each containing a subset of input samples—to the , affected by distinct communication delay uncertainties. The wrapper aligns asynchronous packets and converts them into token-based representations derived from the underlying samples for - processing, employing mechanisms to ensure temporal coherence, including optimization. In the figure, uncolored blocks represent missing data (zero-imputed), colored blocks indicate partial or complete data. Semantic embedding levels are denoted as L.I, L.II, L.III, and L.IV; AB and VB refer to auditory and visual buffers; AE and VE to auditory and visual L.I unimodal encoders; and CU to the control unit. For illustration, $d_a = d_v = d_a' = d_v' = d$.
  • Figure 3: Average accuracy over the test set as a function of average end-to-end inference latency. Our neuro-inspired, non-blocking inference wrapper is shown by its two design variants, PaMo and ToMo, which correspond to different strategies for defining the . We compare them to the average minimum end-to-end latency $\bar{T}_{\rm min} = \mathbb{E}[T_{i,{\rm min}}]$ of methods Li2021SpeculativeInferenceWang2023PATCHWu2024AdaFlowXu2024MLLMInference, across varying auditory values $\bar{\gamma}_a$ and fixed visual $\bar{\gamma}_v = 0\,\mathrm{dB}$. We denote $\bar{T}_s = \mathbb{E}[T_{i,s}]$ for $s \in \{a,v\}$ as the average total transmission time per modality. Our methods enable explicit accuracy--latency trade-offs, which approaches cannot provide. The ' [0.5ex][c]4.5mm0.5pt1mm 0.5mm' lines indicate a 5%-drop-margin of accuracy w.r.t. SotA.