Table of Contents
Fetching ...

Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes

Sadik Yagiz Yetim, Gaofeng Dong, Isaac-Neil Zanoria, Ronit Barman, Maggie Wigness, Tarek Abdelzaher, Mani Srivastava, Suhas Diggavi

Abstract

Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.

Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes

Abstract

Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.

Paper Structure

This paper contains 33 sections, 14 equations, 7 figures, 7 tables, 4 algorithms.

Figures (7)

  • Figure 1: Application scenario of GraySense. The scene illustrates a distributed sensing environment consisting of accessible cameras (blue nodes) and inaccessible cameras (gray nodes) whose video streams are encrypted. While blue nodes provide direct visual input, gray nodes contribute only encrypted network traffic, which indirectly reflects scene dynamics through variations in packet sizes.
  • Figure 2: Group of Pictures (GOP) in H.264. Each group starts with an I-frame which is encoded independently. P and B frames are encoded based on their difference from the reference I or P frames. Due to this differential encoding, the packet-size variations are informative regarding the total change in the scene.
  • Figure 3: System overview of GraySense. Information on frame sizes extracted by Packet Grouping along with the extracted image information is fed to the Tracker module which estimates object's visibility by the sensors and its position when it is visible.
  • Figure 4: The Genesis setup used for geometric analysis. Four identical cameras are looking towards the sphere. The blue polygon shows the area in which the sphere moves, a region always visible to all cameras. In each experiment, the sphere moves at a random but constant velocity between randomly sampled initial and final points within this area.
  • Figure 5: Geometric problem. The bold lines denote rays emanating from the camera center. Those rays that are tangent to the sphere intersect the image plane along an ellipse, forming the sphere’s silhouette in the image plane.
  • ...and 2 more figures