Table of Contents
Fetching ...

Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

Oren Sultan, Alex Khasin, Guy Shiran, Asnat Greenstein-Messica, Dafna Shahaf

TL;DR

This work focuses on visual editing tasks, and finds that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications.

Abstract

We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour"), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.

Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

TL;DR

This work focuses on visual editing tasks, and finds that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications.

Abstract

We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour"), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.
Paper Structure (25 sections, 4 equations, 14 figures, 3 tables)

This paper contains 25 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: An illustration of our visual editing task. Users input an image/video and specify the desired visual appearance (upper row: source images, middle: user intents). An LLM interprets these intents, selects tools, and sets parameters. The bottom row displays the generated images by applying the LLM's output in our app. For example, inputting "Morocco" (left) results in warm hues typical of Moroccan landscapes, reflecting its deserts.
  • Figure 2: Our distillation framework approach. (1) We create a dataset by collecting user intents and the output (or potentially multiple outputs, if several users expressed the same intent) of our teacher LLM. We ensure high quality by keeping outputs users chose to export frequently (one output with the highest export rate per intent). After data processing, we randomly split the data into training and test sets. (2) We fine-tune a smaller student LLM on our dataset. (3) Offline, we evaluate the student LLM's selection of tools and predicted parameters. (4) To improve fine-tuning in low-data regimes, we use an LLM to augment the training data by generating similar samples (e.g., "cool tone" from "cool morning") to mistakes of the student LLM. (5) If a better student model is found offline, we conduct an online A/B test.
  • Figure 3: A one-shot, Chain-of-Thought (CoT) prompt for the teacher LLM to generate parameters for the global color grading (adjust) tool. It includes a task description, available tools, and 14 adjustable parameters with specified ranges. The prompt provides an example of a user request for "golden hour" with rationale (TOOL) and output parameters (JSON). An empty JSON means the LLM chose not to use the tool. We can see that the actions in the reasoning (TOOL) match the parameters (JSON) (e.g., "The temperature should be increased to add a warm, golden tone to the image").
  • Figure 4: A one-shot, Chain-of-Thought (CoT) prompt for the teacher LLM to generate parameters for the selective color grading (selective adjust) tool. It includes a task description, available tools, and parameters (six colors with two adjustable parameters each, from -100 to 100). The prompt shows an example user request for "golden hour" with rationale (TOOL) and output parameters (JSON). An empty JSON means the LLM chose not to use the tool. We can see that the actions in the reasoning (TOOL) match the parameters (JSON) (e.g., "We would also slightly reduce the saturation and luminance of the blues and greens...").
  • Figure 5: A one-shot, Chain-of-Thought (CoT) prompt for the teacher LLM to generate parameters for the filter tool. It includes a task description, available tools, and parameters for the filter tool (filter name from LUT presets and intensity from 0 to 100). The prompt provides a user request example for "welding mask" with rationale (TOOL) and output parameters (JSON). Selecting "none" as the filter name indicates the LLM decided not to use the tool. As we can see, the reasoning (TOOL) aligns with the parameters (JSON) (The "night_vision" LUT preset seems the most appropriate).
  • ...and 9 more figures