Table of Contents
Fetching ...

pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

Qingqian Yang, Hao Wang, Sai Qian Zhang, Jian Li, Yang Hua, Miao Pan, Tao Song, Zhengwei Qi, Haibing Guan

Abstract

Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.

pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

Abstract

Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.
Paper Structure (16 sections, 4 equations, 5 figures, 2 tables)

This paper contains 16 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Environmental heterogeneity across different houses for VLN task. Each client corresponds to a distinct house with substantially different spatial layout and structural characteristics.
  • Figure 2: Data heterogeneity analysis on RxR dataset RxR. We visualize house-level statistics along four dimensions: Instruction, measured by the average instruction length; Path, measured by the variance of navigation trajectory lengths within each house; Scale, measured by the number of rooms, indicating the house size; and Complexity, measured by the number of nodes in the induced navigation graph, indicating how many navigation states and decisions the agent needs to encounter. All statistics are normalized to [0,1] across houses. Darker colors indicate lower values, while lighter colors indicate higher values.
  • Figure 3: Success rate and navigation error comparison on R2R dataset. The line and the shadow mean the average and variance of performance across clients.
  • Figure 4: pFedNavi's workflow. pFedNavi operates in three stages: (1) adaptive personalized layer selection; (2) fine-grained parameter fusion; and (3) local training with federated aggregation.
  • Figure 5: Comparison between pFedNavi and FedVLN on R2R dataset using ResNet-152 visual features. (a) The performance curves of various metrics (e.g., SR, OSR, nDTW, and CLS), and (b) loss convergence over communication rounds (red: pFedNavi, blue: FedVLN).