Timing and Memory Telemetry on GPUs for AI Governance
Saleh K. Monfared, Fatemeh Ganji, Dan Holcomb, Shahin Tajik
TL;DR
The paper addresses the challenge of observing GPU utilization in post-deployment, potentially untrusted environments to support AI governance. It introduces a compute-based telemetry framework comprising four primitives—PoW-style parallel work, Verifiable Delay Functions for sequential load, GEMM-based tensor-core puzzles, and VRAM-residency tests—that rely on architectural signals rather than trusted hardware. Through implementation and experimentation on modern GPUs, it demonstrates that timing distributions and memory-residency patterns correlate with real activity and can reveal co-residency, resource contention, and dubious off-chip outsourcing, while allowing privacy-preserving, low-overhead signaling in certain modes. The work provides a foundation for governance-oriented accountability mechanisms that complement traditional attestation, enabling inference-driven oversight across multi-tenant AI deployments.
Abstract
The rapid expansion of GPU-accelerated computing has enabled major advances in large-scale artificial intelligence (AI), while heightening concerns about how accelerators are observed or governed once deployed. Governance is essential to ensure that large-scale compute infrastructure is not silently repurposed for training models, circumventing usage policies, or operating outside legal oversight. Because current GPUs expose limited trusted telemetry and can be modified or virtualized by adversaries, we explore whether compute-based measurements can provide actionable signals of utilization when host and device are untrusted. We introduce a measurement framework that leverages architectural characteristics of modern GPUs to generate timing- and memory-based observables that correlate with compute activity. Our design draws on four complementary primitives: (1) a probabilistic, workload-driven mechanism inspired by Proof-of-Work (PoW) to expose parallel effort, (2) sequential, latency-sensitive workloads derived via Verifiable Delay Functions (VDFs) to characterize scalar execution pressure, (3) General Matrix Multiplication (GEMM)-based tensor-core measurements that reflect dense linear-algebra throughput, and (4) a VRAM-residency test that distinguishes on-device memory locality from off-chip access through bandwidth-dependent hashing. These primitives provide statistical and behavioral indicators of GPU engagement that remain observable even without trusted firmware, enclaves, or vendor-controlled counters. We evaluate their responses to contention, architectural alignment, memory pressure, and power overhead, showing that timing shifts and residency latencies reveal meaningful utilization patterns. Our results illustrate why compute-based telemetry can complement future accountability mechanisms by exposing architectural signals relevant to post-deployment GPU governance.
