TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Hyunwoo Oh; SungHeon Jeong; Suyeon Jang; Hanning Chen; Sanggeon Yun; Tamoghno Das; Mohsen Imani

TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Hyunwoo Oh, SungHeon Jeong, Suyeon Jang, Hanning Chen, Sanggeon Yun, Tamoghno Das, Mohsen Imani

Abstract

Task-oriented object detection (TOOD) atop CLIP offers open-vocabulary, prompt-driven semantics, yet dense per-window computation and heavy memory traffic hinder real-time, power-limited edge deployment. We present \emph{TorR}, a brain-inspired \textbf{algorithm--architecture co-design} that \textbf{replaces CLIP-style dense alignment with a hyperdimensional (HDC) associative reasoner} and turns temporal coherence into reuse. On the \emph{algorithm} side, TorR reformulates alignment as HDC similarity and graph composition, introducing \emph{partial-similarity reuse} via (i) query caching with per-class score accumulation, (ii) exact $δ$-updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \emph{architecture} side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/$δ$/full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28\,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window ($\approx$50\,mJ at 60\,FPS; $\approx$113\,mJ at 30\,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27\%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension $D'$, thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Abstract

-updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \emph{architecture} side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/

/full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28\,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window (

50\,mJ at 60\,FPS;

113\,mJ at 30\,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27\%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension

, thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

Paper Structure (28 sections, 8 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Background and Motivation
Task-Oriented Detection on VLMs
Event-Driven Perception: DVS & SNNs
Hyperdimensional Computing for Alignment/Reasoning
Motivation
Algorithm-Architecture Co-Design
Co-Design Overview
Algorithmic Design
Hardware Architecture
Top-level overview
Shared similarity micro-kernel
Associative cosine aligner
Partial-similarity unit
Reasoner and cache gating
...and 13 more sections

Figures (6)

Figure 1: From CLIP/ViT to TorR. Top: dense token aligner with fixed per-frame cost. Bottom: an event-driven encoder paired with an HDC associative aligner and a lightweight reasoner. Query caching turns temporal coherence into reuse so TorR updates on change.
Figure 2: Bottleneck shift from CLIP/ViT to TorR. Left: TaskCLIP is dominated by the ViT backbone. Right: after ViT$\rightarrow$event encoder, cost shifts to HDC associative search + graph reasoning, which are memory-bound.
Figure 3: TorR overview. Images are used only to transfer CLIP semantics to events. At run time, partial-similarity reuse ($\delta$-updates, caches) and FPS/QoS control align cost with scene dynamics.
Figure 4: Cache-gated HDC reasoner with $\delta$-alignment. A query cache (depth $K$) supplies the nearest prior query $\mathbf{q}^{(t-1)}$. Partial similarity $\rho=\cos(\mathbf{q}^{(t)},\mathbf{q}^{(t-1)})=1-\tfrac{2|\Delta|}{D'}$ selects $\delta$-update (update only flipped indices $\Delta$) or full. Under high load, if $\rho\!\ge\!\tau_{\mathrm{byp}}\!\wedge\!H(N,q)$, cached scores/outputs are reused. Otherwise the aligner computes $s_j=\cos(\mathbf{q},\mathbf{h}_j)$ and the reasoner (for fixed task) applies cached weights $\tilde{w}_j=\cos(\mathbf{g}_P,\mathbf{h}_j)$ with $\mathbf{g}_P=\mathbf{t}\odot r_{\ell_1}\odot\cdots\odot r_{\ell_k}$, producing $\hat{s}_j=s_j\tilde{w}_j$. The FPS/QoS controller gates $D'$ (bank gating).
Figure 5: Similarity-gated top-level accelerator. PSU computes inter-query similarity $\rho$. The FPS/QoS controller selects bypass/$\delta$/full, gates $D'$ (bank enables) and precision, and programs the associative aligner. In $\delta$-mode the aligner uses a $\Delta$-index FIFO for sparse reads from the banked item memory $M\times D$; scores are accumulated, top-$k$ pooled, optionally reasoned (HDC), cached, and returned via host/DMA. Solid arrows = data, dashed = control.
...and 1 more figures

TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Abstract

TorR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

Authors

Abstract

Table of Contents

Figures (6)