Table of Contents
Fetching ...

End-to-End Throughput Benchmarking of Portable Deterministic CNN-Based Signal Processing Pipelines

Christiaan Boerkamp, Akhil John Thomas

TL;DR

The paper tackles hardware fragmentation in real-time DSP workloads by introducing deterministic CNN-based signal-processing pipelines that are trained-free and expressible entirely with CNN primitives. It presents an end-to-end benchmarking methodology that measures sustained throughput, frame rate, incremental energy, and memory for complete RF-to-image ultrasound pipelines on unmodified code across GPUs and TPUs. The study compares three modalities (B-mode, Color Doppler, Power Doppler) and three implementation variants (dynamic indexing, full CNN, sparse matrices), revealing that dynamic indexing excites high GPU throughput but is less portable to TPU, while fully CNN formulations achieve strong portability and high TPU throughput (exceeding 500 MB/s and up to ~104 FPS). The results support the viability of portable, certifiable deterministic DSP pipelines on heterogeneous AI accelerators and establish a deployment-oriented benchmarking framework to guide future optimization and extension to new hardware.

Abstract

This paper presents a benchmarking methodology for evaluating end-to-end performance of deterministic signal-processing pipelines expressed using CNN-compatible primitives. The benchmark targets phased-array workloads such as ultrasound imaging and evaluates complete RF-to-image pipelines under realistic execution conditions. Performance is reported using sustained input throughput (MB/s), effective frame rate (FPS), and, where available, incremental energy per run and peak memory usage. Using this methodology, we benchmark a single deterministic, training-free CNN-based signal-processing pipeline executed unmodified across heterogeneous accelerator platforms, including an NVIDIA RTX 5090 GPU and a Google TPU v5e-1. The results demonstrate how different operator formulations (dynamic indexing, fully CNN-expressed, and sparse-matrix-based) impact performance and portability across architectures. This work is motivated by the need for portable, certifiable signal-processing implementations that avoid hardware-specific refactoring while retaining high performance on modern AI accelerators.

End-to-End Throughput Benchmarking of Portable Deterministic CNN-Based Signal Processing Pipelines

TL;DR

The paper tackles hardware fragmentation in real-time DSP workloads by introducing deterministic CNN-based signal-processing pipelines that are trained-free and expressible entirely with CNN primitives. It presents an end-to-end benchmarking methodology that measures sustained throughput, frame rate, incremental energy, and memory for complete RF-to-image ultrasound pipelines on unmodified code across GPUs and TPUs. The study compares three modalities (B-mode, Color Doppler, Power Doppler) and three implementation variants (dynamic indexing, full CNN, sparse matrices), revealing that dynamic indexing excites high GPU throughput but is less portable to TPU, while fully CNN formulations achieve strong portability and high TPU throughput (exceeding 500 MB/s and up to ~104 FPS). The results support the viability of portable, certifiable deterministic DSP pipelines on heterogeneous AI accelerators and establish a deployment-oriented benchmarking framework to guide future optimization and extension to new hardware.

Abstract

This paper presents a benchmarking methodology for evaluating end-to-end performance of deterministic signal-processing pipelines expressed using CNN-compatible primitives. The benchmark targets phased-array workloads such as ultrasound imaging and evaluates complete RF-to-image pipelines under realistic execution conditions. Performance is reported using sustained input throughput (MB/s), effective frame rate (FPS), and, where available, incremental energy per run and peak memory usage. Using this methodology, we benchmark a single deterministic, training-free CNN-based signal-processing pipeline executed unmodified across heterogeneous accelerator platforms, including an NVIDIA RTX 5090 GPU and a Google TPU v5e-1. The results demonstrate how different operator formulations (dynamic indexing, fully CNN-expressed, and sparse-matrix-based) impact performance and portability across architectures. This work is motivated by the need for portable, certifiable signal-processing implementations that avoid hardware-specific refactoring while retaining high performance on modern AI accelerators.
Paper Structure (20 sections, 3 equations, 3 tables)