Table of Contents
Fetching ...

LAVA: Long-horizon Visual Action based Food Acquisition

Amisha Bhaskar, Rui Liu, Vishnu D. Sharma, Guangyao Shi, Pratap Tokekar

Abstract

Robotic Assisted Feeding (RAF) addresses the fundamental need for individuals with mobility impairments to regain autonomy in feeding themselves. The goal of RAF is to use a robot arm to acquire and transfer food to individuals from the table. Existing RAF methods primarily focus on solid foods, leaving a gap in manipulation strategies for semi-solid and deformable foods. This study introduces Long-horizon Visual Action (LAVA) based food acquisition of liquid, semisolid, and deformable foods. Long-horizon refers to the goal of "clearing the bowl" by sequentially acquiring the food from the bowl. LAVA employs a hierarchical policy for long-horizon food acquisition tasks. The framework uses high-level policy to determine primitives by leveraging ScoopNet. At the mid-level, LAVA finds parameters for primitives using vision. To carry out sequential plans in the real world, LAVA delegates action execution which is driven by Low-level policy that uses parameters received from mid-level policy and behavior cloning ensuring precise trajectory execution. We validate our approach on complex real-world acquisition trials involving granular, liquid, semisolid, and deformable food types along with fruit chunks and soup acquisition. Across 46 bowls, LAVA acquires much more efficiently than baselines with a success rate of 89 +/- 4% and generalizes across realistic plate variations such as different positions, varieties, and amount of food in the bowl. Code, datasets, videos, and supplementary materials can be found on our website.

LAVA: Long-horizon Visual Action based Food Acquisition

Abstract

Robotic Assisted Feeding (RAF) addresses the fundamental need for individuals with mobility impairments to regain autonomy in feeding themselves. The goal of RAF is to use a robot arm to acquire and transfer food to individuals from the table. Existing RAF methods primarily focus on solid foods, leaving a gap in manipulation strategies for semi-solid and deformable foods. This study introduces Long-horizon Visual Action (LAVA) based food acquisition of liquid, semisolid, and deformable foods. Long-horizon refers to the goal of "clearing the bowl" by sequentially acquiring the food from the bowl. LAVA employs a hierarchical policy for long-horizon food acquisition tasks. The framework uses high-level policy to determine primitives by leveraging ScoopNet. At the mid-level, LAVA finds parameters for primitives using vision. To carry out sequential plans in the real world, LAVA delegates action execution which is driven by Low-level policy that uses parameters received from mid-level policy and behavior cloning ensuring precise trajectory execution. We validate our approach on complex real-world acquisition trials involving granular, liquid, semisolid, and deformable food types along with fruit chunks and soup acquisition. Across 46 bowls, LAVA acquires much more efficiently than baselines with a success rate of 89 +/- 4% and generalizes across realistic plate variations such as different positions, varieties, and amount of food in the bowl. Code, datasets, videos, and supplementary materials can be found on our website.
Paper Structure (24 sections, 8 figures)

This paper contains 24 sections, 8 figures.

Figures (8)

  • Figure 1: System setup for LAVA alongside an illustrative description of the proposed framework with snapshots of task execution.
  • Figure 2: LAVA: System Architecture of LAVA wich employs a high level policy(blue) $\pi_H$ to select amongst discrete high level primitives $P_{H}^{k}$, such as wide primitive and Deep primitive, which then further gets refined by mid-level policy (green) $\pi_M$ to select amongst mid-level primitives$P_{M}^{k}$, low-level vision parametrized policy $\pi_L$ (brown) executes trajectory learned from Behavioral cloning for long-horizon dextrous food acquisition.
  • Figure 3: ScoopNet outputs the softmax probabilities over the high-level primitive depending on the type of food items present in the image.
  • Figure 4: TargetNet finds the next "target" item for the wide high-level primitive and the mid-level primitive that decides whether to scoop the target item or to align it first.
  • Figure 5: DepthNet detects the depth ($h$) of the food in the bowl.
  • ...and 3 more figures