Table of Contents
Fetching ...

FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models

Kewei Chen, Yayu Long, Shuai Li, Mingsheng Shang

TL;DR

FT-NCFM introduces a data-centric generative distillation framework for Vision-Language-Action models, tackling data redundancy and inefficiency by synthesizing a high-value coreset. It deploys a two-stage FT Influence Assessment Engine (causal attribution via Influence Functions and contrastive verification with programmatic counterexamples) to assign sample weights, which guide an influence-weighted Neural Characteristic Function Matching (NCFM) distillation to produce the synthetic data. Across CALVIN, Meta-World, and LIBERO, using only 5–10% of synthetic data yields 85–95% of full-data performance with substantial training-time reductions (often >80%), outperforming policy distillation and traditional coreset selection. This work demonstrates that data-level efficiency optimization can be a practical and powerful alternative to model-centric approaches for efficient, high-performance VLA systems, while noting limitations around perturbation coverage and simulator-based counterexamples that warrant future work.

Abstract

The powerful generalization of Vision-Language-Action (VLA) models is bottlenecked by their heavy reliance on massive, redundant, and unevenly valued datasets, hindering their widespread application. Existing model-centric optimization paths, such as model compression (which often leads to performance degradation) or policy distillation (whose products are model-dependent and lack generality), fail to fundamentally address this data-level challenge. To this end, this paper introduces FT-NCFM, a fundamentally different, data-centric generative data distillation framework. Our framework employs a self-contained Fact-Tracing (FT) engine that combines causal attribution with programmatic contrastive verification to assess the intrinsic value of samples. Guided by these assessments, an adversarial NCFM process synthesizes a model-agnostic, information-dense, and reusable data asset. Experimental results on several mainstream VLA benchmarks show that models trained on just 5% of our distilled coreset achieve a success rate of 85-90% compared with training on the full dataset, while reducing training time by over 80%. Our work demonstrates that intelligent data distillation is a highly promising new path for building efficient, high-performance VLA models.

FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models

TL;DR

FT-NCFM introduces a data-centric generative distillation framework for Vision-Language-Action models, tackling data redundancy and inefficiency by synthesizing a high-value coreset. It deploys a two-stage FT Influence Assessment Engine (causal attribution via Influence Functions and contrastive verification with programmatic counterexamples) to assign sample weights, which guide an influence-weighted Neural Characteristic Function Matching (NCFM) distillation to produce the synthetic data. Across CALVIN, Meta-World, and LIBERO, using only 5–10% of synthetic data yields 85–95% of full-data performance with substantial training-time reductions (often >80%), outperforming policy distillation and traditional coreset selection. This work demonstrates that data-level efficiency optimization can be a practical and powerful alternative to model-centric approaches for efficient, high-performance VLA systems, while noting limitations around perturbation coverage and simulator-based counterexamples that warrant future work.

Abstract

The powerful generalization of Vision-Language-Action (VLA) models is bottlenecked by their heavy reliance on massive, redundant, and unevenly valued datasets, hindering their widespread application. Existing model-centric optimization paths, such as model compression (which often leads to performance degradation) or policy distillation (whose products are model-dependent and lack generality), fail to fundamentally address this data-level challenge. To this end, this paper introduces FT-NCFM, a fundamentally different, data-centric generative data distillation framework. Our framework employs a self-contained Fact-Tracing (FT) engine that combines causal attribution with programmatic contrastive verification to assess the intrinsic value of samples. Guided by these assessments, an adversarial NCFM process synthesizes a model-agnostic, information-dense, and reusable data asset. Experimental results on several mainstream VLA benchmarks show that models trained on just 5% of our distilled coreset achieve a success rate of 85-90% compared with training on the full dataset, while reducing training time by over 80%. Our work demonstrates that intelligent data distillation is a highly promising new path for building efficient, high-performance VLA models.

Paper Structure

This paper contains 27 sections, 6 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our proposed FT-NCFM framework. The framework follows a three-stage pipeline to distill a high-value synthetic coreset from large VLA datasets for efficient robot policy learning. (Top Left) Multimodal Representation Module: Converts raw VLA data streams into unified token sequences through respective encoders. A Transformer backbone fuses them into a global feature representation h. (Top Right) FT Influence Assessment Engine: A two-stage value assessment process. Stage one uses influence functions for causal attribution to calculate base influence scores. Stage two performs contrastive verification on top-K% elite samples by programmatically generating "minimal counterexamples" in a simulator, refining their influence weights W. (Bottom) Influence-Guided NCFM Distillation: The weights W guide an adversarial network through influence-aware sampling. The discriminator $\Psi$ contrasts "weighted real sample features" with "synthetic features" and provides feedback gradients to the generator G. G produces a synthetic coreset for efficient training of downstream VLA models. The t-SNE plots (a), (b), and (c) further visualize this process: (a) shows the original feature distribution by task (different colors represent different tasks); (b) shows the influence value heatmap after FT assessment (color intensity indicates influence weights); (c) clearly demonstrates that our method's synthetic coreset successfully covers the feature distribution of the original high-value samples, reflecting higher information density, with samples categorized into high-value (blue, 1975), medium-value (orange, 287), and low-value (gray, 238) regions.
  • Figure 2: Detailed workflow of the "Contrastive Verification Refinement" stage in the FT engine. This figure illustrates how we refine the value of the top-K% elite samples through a two-stage process. Step 1 explains how we programmatically generate a corresponding "minimal counterexample" $d_{contrast}$ for each elite sample $d_{i}$ in the simulator through an automated "Template Matching and Instantiation" process. Step 2 shows how we quantify the value difference between $d_{i}$ and its counterexample $d_{contrast}$ into the final, refined influence weight $w_{i}$ through influence comparison and a weight modulation function.
  • Figure 3: Instantiation effects of the programmatic perturbation templates. This figure demonstrates the general applicability of our designed perturbation templates across three different VLA tasks, with each pair of columns representing an independent task. In each group, the top-left image is the successful original scene, and the other three are "minimal counterexamples" generated from our template library: (Top Right) Object Substitution, replacing the key interactive object in the task with a new object of different function or form; (Bottom Left) Size Scaling, significantly scaling the key object in the task; (Bottom Right) Position Change, moving the key object so that it no longer satisfies the spatial description in the original instruction.
  • Figure 4: Qualitative experimental results of our FT-NCFM framework on a real-world robot manipulation task. This figure uses the "stack six bowls" task as an example, showing the impact of different training data on the SpatialVLA model's performance through a three-row comparison. (Row 1) Baseline Model: The model trained on 100% of the original data successfully completes the task. (Row 2) Our Method (2.5% Coreset): Trained on only 2.5% of the synthetic data, the model has grasped the core logic of the task, with only minor deficiencies in final placement precision. (Row 3) Our Method (5% Coreset): When the coreset size is increased to 5%, the model can accurately and successfully complete the entire task, with performance comparable to the baseline model trained on 100% of the data.
  • Figure 5: Instantiation effects of programmatic perturbation templates on multiple VLA tasks. This figure shows the application of our three core perturbation templates across seven different robot manipulation tasks. Each row represents an independent task, with complexity and scene diversity gradually increasing from top to bottom. Each column shows a different scenario: (First Column) Original Task: A reference image where the VLA model successfully completes the task in the original, unperturbed environment. (Second Column) Object Substitution: The key interactive object in the task is replaced with a new object of different functionality or morphology (e.g., changing a white cup to a red cup). (Third Column) Size Scaling: The key interactive object is significantly enlarged or shrunk (e.g., making the cup huge). (Fourth Column) Position Change: The key object is moved to a new position in the scene, such that it no longer meets the spatial or contextual requirements of the original instruction. These "minimal counterexamples," automatically generated via templates, are used in the contrastive validation stage of the FT-Engine to precisely evaluate the causal contribution and generalization value of the original samples, thereby guiding the generation of a more information-dense synthetic coreset.