Table of Contents
Fetching ...

InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation

Jinbin Huang, Wenbin He, Liang Gou, Liu Ren, Chris Bryan

TL;DR

InFiConD tackles the challenge of deploying large pretrained vision models in resource-constrained environments by enabling interpretable, concept-based knowledge distillation (KD) and no-code fine-tuning. The framework builds a model-agnostic KD pipeline that maps images into 584-dimensional, text-labeled concept vectors drawn from a CLIP-S4 based concept corpus, and trains an ensemble of $k=20$ linear student models using BCE loss with $L1$ regularization ($\lambda=10^{-4}$). A no-code concept tuning workflow allows users to uptune or downtune specific concepts by applying bounded constraints $x$-scaled around original weights, with a provenance log of tuning actions and performance impacts. Validation via usage scenarios and a user study demonstrates that the human-in-the-loop, visualization-driven approach supports effective knowledge transfer, rapid fine-tuning, and broader accessibility for domain-specific applications, without extensive coding requirements.

Abstract

The emergence of large-scale pre-trained models has heightened their application in various downstream tasks, yet deployment is a challenge in environments with limited computational resources. Knowledge distillation has emerged as a solution in such scenarios, whereby knowledge from large teacher models is transferred into smaller student' models, but this is a non-trivial process that traditionally requires technical expertise in AI/ML. To address these challenges, this paper presents InFiConD, a novel framework that leverages visual concepts to implement the knowledge distillation process and enable subsequent no-code fine-tuning of student models. We develop a novel knowledge distillation pipeline based on extracting text-aligned visual concepts from a concept corpus using multimodal models, and construct highly interpretable linear student models based on visual concepts that mimic a teacher model in a response-based manner. InFiConD's interface allows users to interactively fine-tune the student model by manipulating concept influences directly in the user interface. We validate InFiConD via a robust usage scenario and user study. Our findings indicate that InFiConD's human-in-the-loop and visualization-driven approach enables users to effectively create and analyze student models, understand how knowledge is transferred, and efficiently perform fine-tuning operations. We discuss how this work highlights the potential of interactive and visual methods in making knowledge distillation and subsequent no-code fine-tuning more accessible and adaptable to a wider range of users with domain-specific demands.

InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation

TL;DR

InFiConD tackles the challenge of deploying large pretrained vision models in resource-constrained environments by enabling interpretable, concept-based knowledge distillation (KD) and no-code fine-tuning. The framework builds a model-agnostic KD pipeline that maps images into 584-dimensional, text-labeled concept vectors drawn from a CLIP-S4 based concept corpus, and trains an ensemble of linear student models using BCE loss with regularization (). A no-code concept tuning workflow allows users to uptune or downtune specific concepts by applying bounded constraints -scaled around original weights, with a provenance log of tuning actions and performance impacts. Validation via usage scenarios and a user study demonstrates that the human-in-the-loop, visualization-driven approach supports effective knowledge transfer, rapid fine-tuning, and broader accessibility for domain-specific applications, without extensive coding requirements.

Abstract

The emergence of large-scale pre-trained models has heightened their application in various downstream tasks, yet deployment is a challenge in environments with limited computational resources. Knowledge distillation has emerged as a solution in such scenarios, whereby knowledge from large teacher models is transferred into smaller student' models, but this is a non-trivial process that traditionally requires technical expertise in AI/ML. To address these challenges, this paper presents InFiConD, a novel framework that leverages visual concepts to implement the knowledge distillation process and enable subsequent no-code fine-tuning of student models. We develop a novel knowledge distillation pipeline based on extracting text-aligned visual concepts from a concept corpus using multimodal models, and construct highly interpretable linear student models based on visual concepts that mimic a teacher model in a response-based manner. InFiConD's interface allows users to interactively fine-tune the student model by manipulating concept influences directly in the user interface. We validate InFiConD via a robust usage scenario and user study. Our findings indicate that InFiConD's human-in-the-loop and visualization-driven approach enables users to effectively create and analyze student models, understand how knowledge is transferred, and efficiently perform fine-tuning operations. We discuss how this work highlights the potential of interactive and visual methods in making knowledge distillation and subsequent no-code fine-tuning more accessible and adaptable to a wider range of users with domain-specific demands.
Paper Structure (29 sections, 1 equation, 8 figures)

This paper contains 29 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: InFiCond's pipeline (a) extracts text-aligned visual concepts, (b) maps images into concept-based interpretable vectors, (c) trains linear student models that based on these vectors guided by teacher logits, and (d) uses a visual analytics interface to visualize these concepts for the student model, emphasizing those with high influences to facilitate model analysis and fine-tuning.
  • Figure 2: We employ the CLIP-S$^4$ model, trained to align pixel embeddings (image segments) with text embeddings, to extract text-labeled visual concepts.
  • Figure 3: To perform concept mapping on an image, we compute the pairwise cosine similarity between its segment embeddings and concept vectors, and use the maximum cosine similarity value among all segments to represent a concept's degree of presence in the image.
  • Figure 4: (a) We train a linear student model for each class to imitate the teacher model's predictions. The student models consist of a single fully connected layer without activation functions. Each neural connection from an input node to the output node represents the influence of a concept on a specific class. (b) Users can provide instructions specifying which concepts to tune and how to adjust their importance (i.e., increase or decrease).
  • Figure 5: In three iterations, Sam improves the tv monitor class from underperforming by 1.41% to outperforming by 0.77%, with an overall average precision increase of 2.18%.
  • ...and 3 more figures