Table of Contents
Fetching ...

ISCS: Parameter-Guided Feature Pruning for Resource-Constrained Embodied Perception

Jinhao Wang, Nam Ling, Wei Wang, Wei Jiang

TL;DR

This work tackles the latency-accuracy challenge of perception in resource-constrained embodied agents by introducing Invariant Salient Channel Space (ISCS), a dataset-agnostic scaffold that discovers structure-critical (Salient-Core) channels and their correlated details (Salient-Auxiliary) from pretrained encoder weights. It then performs entropy-free static pruning, retaining SC channels and bypassing costly entropy models to achieve ultra-low latency through a deterministic split-computing protocol. Empirical results on COCO and UCF101 demonstrate near-original performance even at only 25% channel retention, with substantial reductions in both encoding time (to ~0.45s) and edge-side decoding latency (to ~3.4ms), validating a favorable rate-latency-accuracy trade-off. The approach offers a practical, robust pathway to real-time, human-aware embodied perception in edge-robot collaboration, with potential for adaptive extension under varying channel conditions or tasks.

Abstract

Prior studies in embodied AI consistently show that robust perception is critical for human-robot interaction, yet deploying high-fidelity visual models on resource-constrained agents remains challenging due to limited on-device computation power and transmission latency. Exploiting the redundancy in latent representations could improve system efficiency, yet existing approaches often rely on costly dataset-specific ablation tests or heavy entropy models unsuitable for real-time edge-robot collaboration. We propose a generalizable, dataset-agnostic method to identify and selectively transmit structure-critical channels in pretrained encoders. Instead of brute-force empirical evaluations, our approach leverages intrinsic parameter statistics-weight variances and biases-to estimate channel importance. This analysis reveals a consistent organizational structure, termed the Invariant Salient Channel Space (ISCS), where Salient-Core channels capture dominant structures while Salient-Auxiliary channels encode fine visual details. Building on ISCS, we introduce a deterministic static pruning strategy that enables lightweight split-computing. Experiments across different datasets demonstrate that our method achieves a deterministic, ultra-low latency pipeline by bypassing heavy entropy modeling. Our method reduces end-to-end latency, providing a critical speed-accuracy trade-off for resource-constrained human-aware embodied systems.

ISCS: Parameter-Guided Feature Pruning for Resource-Constrained Embodied Perception

TL;DR

This work tackles the latency-accuracy challenge of perception in resource-constrained embodied agents by introducing Invariant Salient Channel Space (ISCS), a dataset-agnostic scaffold that discovers structure-critical (Salient-Core) channels and their correlated details (Salient-Auxiliary) from pretrained encoder weights. It then performs entropy-free static pruning, retaining SC channels and bypassing costly entropy models to achieve ultra-low latency through a deterministic split-computing protocol. Empirical results on COCO and UCF101 demonstrate near-original performance even at only 25% channel retention, with substantial reductions in both encoding time (to ~0.45s) and edge-side decoding latency (to ~3.4ms), validating a favorable rate-latency-accuracy trade-off. The approach offers a practical, robust pathway to real-time, human-aware embodied perception in edge-robot collaboration, with potential for adaptive extension under varying channel conditions or tasks.

Abstract

Prior studies in embodied AI consistently show that robust perception is critical for human-robot interaction, yet deploying high-fidelity visual models on resource-constrained agents remains challenging due to limited on-device computation power and transmission latency. Exploiting the redundancy in latent representations could improve system efficiency, yet existing approaches often rely on costly dataset-specific ablation tests or heavy entropy models unsuitable for real-time edge-robot collaboration. We propose a generalizable, dataset-agnostic method to identify and selectively transmit structure-critical channels in pretrained encoders. Instead of brute-force empirical evaluations, our approach leverages intrinsic parameter statistics-weight variances and biases-to estimate channel importance. This analysis reveals a consistent organizational structure, termed the Invariant Salient Channel Space (ISCS), where Salient-Core channels capture dominant structures while Salient-Auxiliary channels encode fine visual details. Building on ISCS, we introduce a deterministic static pruning strategy that enables lightweight split-computing. Experiments across different datasets demonstrate that our method achieves a deterministic, ultra-low latency pipeline by bypassing heavy entropy modeling. Our method reduces end-to-end latency, providing a critical speed-accuracy trade-off for resource-constrained human-aware embodied systems.

Paper Structure

This paper contains 26 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Reconstruction loss from single-channel removal. The y-axis shows per-channel BPP estimated by the model’s probability model. Most channels (blue dots) exhibit a strong logarithmic correlation between per-channel BPP and reconstruction quality degradation. However, a small set of outlier channels (red dots) exist with only small BPP yet causing disproportionately large loss when removed.
  • Figure 2: Overview of the proposed Entropy-free Split-Computing framework. The resource-constrained Robot (Client) encodes the visual input and performs ISCS-based Static Pruning, physically discarding machine-redundant SA channels while retaining structure-critical SC channels. To minimize latency, the selected features are transmitted via a lightweight raw packing protocol, bypassing the computationally expensive entropy model. The Edge Server receives the raw bitstream and reconstructs the sparse latent representation $\hat{y}$ via zero-padding for downstream perception tasks.
  • Figure 3: Visual examples of per-channel latent activations for different types of channels in ISCS. The SC channels clearly capture prominent structural patterns. Their corresponding SA channels provide complementary cues, such as fine visual details. In contrast, bias-dominated channels exhibit near-constant activations across different image inputs.
  • Figure 4: Qualitative robustness comparison. Our ISCS method preserves critical structural semantics, enabling accurate human detection even at high pruning. In contrast, the Random baseline suffers from catastrophic feature collapse.