Table of Contents
Fetching ...

Grounded Language Agent for Product Search via Intelligent Web Interactions

Moghis Fereidouni, Adib Mosharrof, A. B. Siddique

TL;DR

GLAINTEL introduces a grounded language agent for product search that operates over interactive web pages with a dynamic action space. Built on Flan-T5, it supports unsupervised learning, supervised learning, and unsupervised domain adaptation, with PPO driving unsupervised policy optimization and an auxiliary value head aiding learning. Key findings show that unsupervised PPO with a relatively small model can surpass large in-context LLM baselines, that naive behavioral cloning may underperform compared to RL, and that combining demonstrations with PPO yields the strongest results—approaching GPT-4 in some settings and outperforming it cost-effectively in others. The work demonstrates practical viability for real-world web navigation and provides code for reproducibility, with notable implications for scalable, cost-efficient intelligent agents in e-commerce contexts.

Abstract

The development of agents powered by large language models (LLMs) to accomplish complex high-level user intents, has attracted significant attention recently. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.

Grounded Language Agent for Product Search via Intelligent Web Interactions

TL;DR

GLAINTEL introduces a grounded language agent for product search that operates over interactive web pages with a dynamic action space. Built on Flan-T5, it supports unsupervised learning, supervised learning, and unsupervised domain adaptation, with PPO driving unsupervised policy optimization and an auxiliary value head aiding learning. Key findings show that unsupervised PPO with a relatively small model can surpass large in-context LLM baselines, that naive behavioral cloning may underperform compared to RL, and that combining demonstrations with PPO yields the strongest results—approaching GPT-4 in some settings and outperforming it cost-effectively in others. The work demonstrates practical viability for real-world web navigation and provides code for reproducibility, with notable implications for scalable, cost-efficient intelligent agents in e-commerce contexts.

Abstract

The development of agents powered by large language models (LLMs) to accomplish complex high-level user intents, has attracted significant attention recently. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.
Paper Structure (19 sections, 5 equations, 4 figures, 11 tables)

This paper contains 19 sections, 5 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Overview of $\mathsf{GLAINTEL}$: Our agent employs the Flan-T5 architecture and incorporates a language modeling head to adapt to dynamic action space, while the value head enables precise value estimation.
  • Figure 2: Learning curves of different methodologies: Unsupervised Domain Adaptation (UDA), Hybrid (BC + PPO) ($\mathsf{GLAINTEL}$), and RL-based Unsupervised (PPO).
  • Figure 3: Hybrid setting: BC + PPO: Flan-T5 is more sample efficient than T5 model.
  • Figure 4: The model is more sample efficient when we feed it with the last two observations.