Table of Contents
Fetching ...

Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

Haichao Zhang, Yi Xu, Yun Fu

Abstract

Trajectory prediction is a fundamental problem in computer vision, vision-language-action models, world models, and autonomous systems, with broad impact on autonomous driving, robotics, and surveillance. However, most existing methods assume complete and clean observations, and therefore do not adequately handle out-of-sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and the absence of ground-truth denoised trajectories. These challenges raise safety concerns and reduce robustness in real-world deployment. In this extended study, we introduce major improvements to Out-of-Sight Trajectory (OST), a task for predicting noise-free visual trajectories of out-of-sight objects from noisy sensor observations. Building on our prior work, we expand Out-of-Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles, increasing its relevance to autonomous driving, robotics, and surveillance. Our improved Vision-Positioning Denoising Module exploits camera calibration to establish vision-position correspondence, mitigating the lack of direct visual cues and enabling effective unsupervised denoising of noisy sensor signals. Extensive experiments on the Vi-Fi and JRDB datasets show that our method achieves state-of-the-art results for both trajectory denoising and trajectory prediction, with clear gains over prior baselines. We also compare with classical denoising methods, including Kalman filtering, and adapt recent trajectory prediction models to this setting, establishing a stronger benchmark. To the best of our knowledge, this is the first work to use vision-positioning projection to denoise noisy sensor trajectories of out-of-sight agents, opening new directions for future research.

Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

Abstract

Trajectory prediction is a fundamental problem in computer vision, vision-language-action models, world models, and autonomous systems, with broad impact on autonomous driving, robotics, and surveillance. However, most existing methods assume complete and clean observations, and therefore do not adequately handle out-of-sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and the absence of ground-truth denoised trajectories. These challenges raise safety concerns and reduce robustness in real-world deployment. In this extended study, we introduce major improvements to Out-of-Sight Trajectory (OST), a task for predicting noise-free visual trajectories of out-of-sight objects from noisy sensor observations. Building on our prior work, we expand Out-of-Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles, increasing its relevance to autonomous driving, robotics, and surveillance. Our improved Vision-Positioning Denoising Module exploits camera calibration to establish vision-position correspondence, mitigating the lack of direct visual cues and enabling effective unsupervised denoising of noisy sensor signals. Extensive experiments on the Vi-Fi and JRDB datasets show that our method achieves state-of-the-art results for both trajectory denoising and trajectory prediction, with clear gains over prior baselines. We also compare with classical denoising methods, including Kalman filtering, and adapt recent trajectory prediction models to this setting, establishing a stronger benchmark. To the best of our knowledge, this is the first work to use vision-positioning projection to denoise noisy sensor trajectories of out-of-sight agents, opening new directions for future research.

Paper Structure

This paper contains 39 sections, 15 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 2: An illustrative example of real-world out-of-sight settings, using autonomous driving as a representative scenario. The autonomous vehicle is equipped with a camera (recording accurate visual trajectories, shown as green arrows) and a mobile signal receiver (collecting noisy sensor trajectories from IoT devices such as mobile phones, smartwatches, smart rings, or AirTags on pedestrians, as well as onboard computers or communication devices on vehicles, shown as red arrows) to monitor pedestrians and nearby vehicles. Pedestrians $A_1$ and $A_2$ are visible within the camera field of view, while $A_3$ is fully out of sight and $A_4$ is occluded by other vehicles. In addition, a bus ($A_5$) is visible in the camera view, whereas a truck ($A_6$) is moving into the vehicle's path but remains unseen because it is out of sight. Consequently, $A_3$ and $A_4$ do not have visual trajectory observations, creating substantial collision risks. The black dotted arrows denote hypothesized noise-free ground-truth trajectories that mobile sensors would ideally capture, in contrast to the observed noisy sensor trajectories (red arrows). The gray-shaded region indicates the visibility coverage of the mobile and visual modalities: white denotes no data captured, orange denotes the presence of visual trajectories, and blue denotes mobile trajectory availability.
  • Figure 3: Overview of the Vision-Positioning Denoising and Prediction Model architecture. The figure illustrates the processing pipeline for agent trajectories, where pedestrian agent $A_1$ and autonomous robot agent $A_4$ are outside the camera field of view and can only be detected through sensor signals received by mobile receivers, while pedestrian agent $A_3$ and vehicle agent $A_2$ are observed by both the camera and sensor modality. The Mapping Parameters Estimator Module uses dual-modality trajectories of visible agents (e.g., $A_2$ and $A_3$) to learn a mapping matrix embedding, which is estimated for each frame separately to accommodate ego-system motion. For out-of-sight agents (e.g., $A_1$ and $A_4$), noisy mobile trajectories are refined by the Sensor Denoising Encoder, producing a denoised signal embedding. This embedding is then fused with the mapping matrix embedding in the Visual Positioning Projection Module, allowing projection into camera coordinates. The transformation is optimized with $\mathcal{L}_\text{Denoise}$. Finally, the Out-of-Sight Prediction Decoder takes the denoised visual signals and predicts future trajectories for agents outside the camera view, addressing the out-of-sight trajectory prediction task.
  • Figure 4: Per-frame center-point $\ell_2$ error on representative test sequences. We plot $\|\hat{\mathbf{c}}_t-\mathbf{c}_t\|_2$ over time for ground truth (GT) and different predictors. Left: Ours vs. ViTag. Right: Ours vs. Vanilla Transformer. The inset reports ADE and FDE computed on valid frames after robust filtering.
  • Figure : (a)
  • Figure : (a)
  • ...and 3 more figures