Table of Contents
Fetching ...

A First Look at Immersive Telepresence on Apple Vision Pro

Ruizhi Cheng, Nan Wu, Matteo Varvello, Eugene Chai, Songqing Chen, Bo Han

TL;DR

This work presents the first empirical measurement of immersive telepresence on Apple Vision Pro, evaluating FaceTime, Webex, Teams, and Zoom. It reveals that FaceTime uniquely delivers a true spatial persona while leveraging semantic communication to achieve sub-1 Mbps bandwidth, albeit with rate-adaptation and scalability limitations. The study uncovers visibility-aware rendering optimizations that significantly reduce GPU load but do not cut bandwidth, and shows that server proximity can induce RTTs above 100 ms, highlighting the need for geo-distributed or remote-rendering solutions. Together, these findings guide the design of scalable, immersive telepresence systems for head-mounted displays and inform practical deployment considerations.

Abstract

Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, they often lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mobile headset that supports "spatial persona", aims to offer an immersive telepresence experience. In this paper, we conduct a first-of-its-kind in-depth and empirical study to analyze the performance of immersive telepresence with Apple FaceTime, Cisco Webex, Microsoft Teams, and Zoom on Vision Pro. We find that only FaceTime provides a truly immersive experience with spatial personas, whereas others still operate 2D personas. Our measurement results reveal that (1) FaceTime delivers semantic data to optimize bandwidth consumption, which is even lower than that of 2D persona for other applications, and (2) it employs visibility-aware optimizations to reduce rendering overhead. However, the scalability of FaceTime remains limited, with a simple server-allocation strategy that potentially leads to high network delay for users.

A First Look at Immersive Telepresence on Apple Vision Pro

TL;DR

This work presents the first empirical measurement of immersive telepresence on Apple Vision Pro, evaluating FaceTime, Webex, Teams, and Zoom. It reveals that FaceTime uniquely delivers a true spatial persona while leveraging semantic communication to achieve sub-1 Mbps bandwidth, albeit with rate-adaptation and scalability limitations. The study uncovers visibility-aware rendering optimizations that significantly reduce GPU load but do not cut bandwidth, and shows that server proximity can induce RTTs above 100 ms, highlighting the need for geo-distributed or remote-rendering solutions. Together, these findings guide the design of scalable, immersive telepresence systems for head-mounted displays and inform practical deployment considerations.

Abstract

Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, they often lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mobile headset that supports "spatial persona", aims to offer an immersive telepresence experience. In this paper, we conduct a first-of-its-kind in-depth and empirical study to analyze the performance of immersive telepresence with Apple FaceTime, Cisco Webex, Microsoft Teams, and Zoom on Vision Pro. We find that only FaceTime provides a truly immersive experience with spatial personas, whereas others still operate 2D personas. Our measurement results reveal that (1) FaceTime delivers semantic data to optimize bandwidth consumption, which is even lower than that of 2D persona for other applications, and (2) it employs visibility-aware optimizations to reduce rendering overhead. However, the scalability of FaceTime remains limited, with a simple server-allocation strategy that potentially leads to high network delay for users.
Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Figures (7)

  • Figure 1: (a) Spatial persona on FaceTime vs. (b) 2D persona on Webex.
  • Figure 2: Cameras on Apple Vision Pro.
  • Figure 3: Measurement setup with two users, U1 and U2.
  • Figure 4: Round-trip time between FaceTime (F), Zoom (Z), Webex (W), and Teams (T) servers and test users. The server locations are indicated by their abbreviations: CA (California), TX (Texas), IL (Illinois), VA (Virginia), NJ (New Jersey), and WA (Washington State).
  • Figure 5: Throughput of FaceTime with spatial persona (F), FaceTime with 2D persona (F*), Zoom (Z), Webex (W), and Teams (T) with two participants. Blue dots represent mean values.
  • ...and 2 more figures