Table of Contents
Fetching ...

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

Xiaobei Zhao, Xingqi Lyu, Xin Chen, Xiang Li

Abstract

Agricultural robotic agents have been becoming useful helpers in a wide range of agricultural tasks. However, they still heavily rely on manual operations or fixed railways for movement. To address this limitation, the AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents to navigate to the target positions following the natural language instructions. We observe that AgriVLN can effectively understands the simple instructions, but often misunderstands the complex ones. To bridge this gap, we propose the T-araVLN method, in which we build the instruction translator module to translate noisy and mistaken instructions into refined and precise representations. When evaluated on A2A, our T-araVLN successfully improves Success Rate (SR) from 0.47 to 0.63 and reduces Navigation Error (NE) from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/T-araVLN.

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

Abstract

Agricultural robotic agents have been becoming useful helpers in a wide range of agricultural tasks. However, they still heavily rely on manual operations or fixed railways for movement. To address this limitation, the AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents to navigate to the target positions following the natural language instructions. We observe that AgriVLN can effectively understands the simple instructions, but often misunderstands the complex ones. To bridge this gap, we propose the T-araVLN method, in which we build the instruction translator module to translate noisy and mistaken instructions into refined and precise representations. When evaluated on A2A, our T-araVLN successfully improves Success Rate (SR) from 0.47 to 0.63 and reduces Navigation Error (NE) from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/T-araVLN.

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: T-araVLN (right) v.s. the baseline (left) on a simple example: Noisy and mistaken instructions such as “close” are translated to refined and precise representations such as “she occupies all your camera view”, leading to an easier alignment between linguistic and visual inputs for the decision-maker.
  • Figure 2: T-araVLN methodology illustration: Prompted by the five translation principles with the task description and the format restriction, T-araVLN translates the original instruction to be more refined and precise, then understands both linguistic and visual inputs to predict the low-level action sequence, navigating the agent to move from the starting point to the target position. (*The base model's elaborate prediction process is detailed in AgriVLN arXiv:AgriVLN)
  • Figure 3: Ablation study illustrations on the five translation principles in the instruction translator. $\uparrow$ and $\downarrow$ indicate that higher and lower values correspond to better performance, respectively. “+” represents integration. “$\text{-\,-\,-}$” marks the baseline score.
  • Figure 4: Qualitative experiment illustration on a representative episode: The original instruction includes several noises and mistakes, resulting in the challenging alignment between linguistic and visual inputs. Our T-araVLN, however, translates the instruction to be refined and precise, effectively improving the alignment performance. ✓ represents correct prediction after rational reasoning; ✓ represents correct prediction after illogical reasoning; ✗ represents wrong prediction after illogical reasoning. Only for a complete demonstration, the baseline model is reset to the ground-truth path after deviation. (Zoom in for a better observation)