Table of Contents
Fetching ...

HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

Shaoyin Ma, Jie Song, Huiqiong Wang, Li Sun, Mingli Song

TL;DR

The paper addresses the challenge of selecting optimal community-driven models from large, evolving hubs with incomplete metadata and prompt-bloat. It introduces HuggingR^4, a progressive reasoning framework comprising Reasoning, Retrieval, Refinement, and Reflection, augmented by vector-based retrieval, a failure-trace mechanism, and a sliding-window strategy to limit token use. A first forward-labeled dataset with 14,399 requests across 37 tasks supports extensive evaluation, where HuggingR^4 achieves substantial gains in workability and reasonability over baselines and shows token usage stability against growing candidate pools. The approach enables scalable, online adaptation to changing model ecosystems like HuggingFace and is applicable across multimodal tasks, offering practical improvements for building AI agents with diverse external interfaces.

Abstract

Large Language Models (LLMs) have made remarkable progress in their ability to interact with external interfaces. Selecting reasonable external interfaces has thus become a crucial step in constructing LLM agents. In contrast to invoking API tools, directly calling AI models across different modalities from the community (e.g., HuggingFace) poses challenges due to the vast scale (> 10k), metadata gaps, and unstructured descriptions. Current methods for model selection often involve incorporating entire model descriptions into prompts, resulting in prompt bloat, wastage of tokens and limited scalability. To address these issues, we propose HuggingR$^4$, a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection, to efficiently select models. Specifically, We first perform multiple rounds of reasoning and retrieval to get a coarse list of candidate models. Then, we conduct fine-grained refinement by analyzing candidate model descriptions, followed by reflection to assess results and determine if retrieval scope expansion is necessary. This method reduces token consumption considerably by decoupling user query processing from complex model description handling. Through a pre-established vector database, complex model descriptions are stored externally and retrieved on-demand, allowing the LLM to concentrate on interpreting user intent while accessing only relevant candidate models without prompt bloat. In the absence of standardized benchmarks, we construct a multimodal human-annotated dataset comprising 14,399 user requests across 37 tasks and conduct a thorough evaluation. HuggingR$^4$ attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively on GPT-4o-mini.

HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

TL;DR

The paper addresses the challenge of selecting optimal community-driven models from large, evolving hubs with incomplete metadata and prompt-bloat. It introduces HuggingR^4, a progressive reasoning framework comprising Reasoning, Retrieval, Refinement, and Reflection, augmented by vector-based retrieval, a failure-trace mechanism, and a sliding-window strategy to limit token use. A first forward-labeled dataset with 14,399 requests across 37 tasks supports extensive evaluation, where HuggingR^4 achieves substantial gains in workability and reasonability over baselines and shows token usage stability against growing candidate pools. The approach enables scalable, online adaptation to changing model ecosystems like HuggingFace and is applicable across multimodal tasks, offering practical improvements for building AI agents with diverse external interfaces.

Abstract

Large Language Models (LLMs) have made remarkable progress in their ability to interact with external interfaces. Selecting reasonable external interfaces has thus become a crucial step in constructing LLM agents. In contrast to invoking API tools, directly calling AI models across different modalities from the community (e.g., HuggingFace) poses challenges due to the vast scale (> 10k), metadata gaps, and unstructured descriptions. Current methods for model selection often involve incorporating entire model descriptions into prompts, resulting in prompt bloat, wastage of tokens and limited scalability. To address these issues, we propose HuggingR, a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection, to efficiently select models. Specifically, We first perform multiple rounds of reasoning and retrieval to get a coarse list of candidate models. Then, we conduct fine-grained refinement by analyzing candidate model descriptions, followed by reflection to assess results and determine if retrieval scope expansion is necessary. This method reduces token consumption considerably by decoupling user query processing from complex model description handling. Through a pre-established vector database, complex model descriptions are stored externally and retrieved on-demand, allowing the LLM to concentrate on interpreting user intent while accessing only relevant candidate models without prompt bloat. In the absence of standardized benchmarks, we construct a multimodal human-annotated dataset comprising 14,399 user requests across 37 tasks and conduct a thorough evaluation. HuggingR attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively on GPT-4o-mini.

Paper Structure

This paper contains 54 sections, 7 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Comparison of different approaches for model selection. HuggingR$^{4*}$ represents the retrieval-only version.
  • Figure 2: Workflow of HuggingR$^4$ with an example. The process includes: 1) Reasoning and Retrieval: Iterative querying and top-k candidate retrieval from the vector database. 2) Refinement: Fine-grained model selection via sliding window access to complete model cards. 3) Reflection: Self-evaluation to verify model suitability.
  • Figure 3: The sliding window strategy treats each model as a window and ranks them in descending order of similarity scores. The color of each window indicates the system’s access level to the corresponding model card during selection.
  • Figure 4: Token usage comparison across different numbers of candidate models using GPT-4o-mini with text-embedding-3-large (log scale). Our HuggingR$^4$ and HuggingR$^{4*}$ maintain constant token consumption through the sliding window strategy, while Direct Prompting and HuggingGPT scale linearly with the number of candidates. At 30 candidates, HuggingR$^4$ achieves 85.6% reduction compared to Direct Prompting.