Table of Contents
Fetching ...

VisEvol: Visual Analytics to Support Hyperparameter Search through Evolutionary Optimization

Angelos Chatzimparmpas, Rafael M. Martins, Kostiantyn Kucher, Andreas Kerren

TL;DR

VisEvol presents a visual analytics platform that couples evolutionary optimization with interactive, model-agnostic hyperparameter search to construct robust, diverse majority-voting ensembles. It offers eight coordinated panels, including a UMAP-based hyperparameter projection and a Sankey-driven process tracker, to guide users through metric selection, model instantiation, and ensemble assembly. The approach is demonstrated on real-world datasets (heart-disease and QSAR biodegradation), achieving strong macro-average accuracy (up to 89% on external data) and outperforming prior consensus methods. Expert interviews corroborate the workflow's usefulness while identifying scalability as a key limitation and a target for future work. Overall, VisEvol provides a practical, interactive workflow for discovering powerful hyperparameter configurations and cohesive ensembles in a model-agnostic setting.

Abstract

During the training phase of machine learning (ML) models, it is usually necessary to configure several hyperparameters. This process is computationally intensive and requires an extensive search to infer the best hyperparameter set for the given problem. The challenge is exacerbated by the fact that most ML models are complex internally, and training involves trial-and-error processes that could remarkably affect the predictive result. Moreover, each hyperparameter of an ML algorithm is potentially intertwined with the others, and changing it might result in unforeseeable impacts on the remaining hyperparameters. Evolutionary optimization is a promising method to try and address those issues. According to this method, performant models are stored, while the remainder are improved through crossover and mutation processes inspired by genetic algorithms. We present VisEvol, a visual analytics tool that supports interactive exploration of hyperparameters and intervention in this evolutionary procedure. In summary, our proposed tool helps the user to generate new models through evolution and eventually explore powerful hyperparameter combinations in diverse regions of the extensive hyperparameter space. The outcome is a voting ensemble (with equal rights) that boosts the final predictive performance. The utility and applicability of VisEvol are demonstrated with two use cases and interviews with ML experts who evaluated the effectiveness of the tool.

VisEvol: Visual Analytics to Support Hyperparameter Search through Evolutionary Optimization

TL;DR

VisEvol presents a visual analytics platform that couples evolutionary optimization with interactive, model-agnostic hyperparameter search to construct robust, diverse majority-voting ensembles. It offers eight coordinated panels, including a UMAP-based hyperparameter projection and a Sankey-driven process tracker, to guide users through metric selection, model instantiation, and ensemble assembly. The approach is demonstrated on real-world datasets (heart-disease and QSAR biodegradation), achieving strong macro-average accuracy (up to 89% on external data) and outperforming prior consensus methods. Expert interviews corroborate the workflow's usefulness while identifying scalability as a key limitation and a target for future work. Overall, VisEvol provides a practical, interactive workflow for discovering powerful hyperparameter configurations and cohesive ensembles in a model-agnostic setting.

Abstract

During the training phase of machine learning (ML) models, it is usually necessary to configure several hyperparameters. This process is computationally intensive and requires an extensive search to infer the best hyperparameter set for the given problem. The challenge is exacerbated by the fact that most ML models are complex internally, and training involves trial-and-error processes that could remarkably affect the predictive result. Moreover, each hyperparameter of an ML algorithm is potentially intertwined with the others, and changing it might result in unforeseeable impacts on the remaining hyperparameters. Evolutionary optimization is a promising method to try and address those issues. According to this method, performant models are stored, while the remainder are improved through crossover and mutation processes inspired by genetic algorithms. We present VisEvol, a visual analytics tool that supports interactive exploration of hyperparameters and intervention in this evolutionary procedure. In summary, our proposed tool helps the user to generate new models through evolution and eventually explore powerful hyperparameter combinations in diverse regions of the extensive hyperparameter space. The outcome is a voting ensemble (with equal rights) that boosts the final predictive performance. The utility and applicability of VisEvol are demonstrated with two use cases and interviews with ML experts who evaluated the effectiveness of the tool.

Paper Structure

This paper contains 9 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The VisEvol workflow allows the users to construct performant and diverse ML ensembles, gain knowledge about the hyperparameters chosen via the evolutionary optimization process, and thus gain trust in the respective ML results. The users are capable of interacting with all phases iteratively, represented by the multiple arrows inside the box.
  • Figure 2: Exploration of ML models with VisEvol. View (a) presents a selection of similar and better-performing models in several clusters. (b.1) indicates that Ⓒ1 contains well-performing RF models, in contrast to (b.2), in which Ⓒ3 includes more diverse RF models and a GradB model. For the accuracy, recall, and f1-score metrics, Ⓒ1 performs much better than the average, based on the bean plots in (c.1). However, Ⓒ3 achieves better results for the precision metric. In the grid-based view (d.1), LR, RF, and GradB algorithms appear more powerful than other algorithms that are more diverse due to the good predictions of hard-to-classify instances. Ⓒ2 seems redundant because of the Ⓒ4 and Ⓒ5 that improve similar cases (d.2). If we look at (d.3) and (d.4), both visualizations display MLP models that enhance the predictive power of different instances in both classes. Finally in view (e), we mix models from the multiple explored clusters to create the first voting ensemble.
  • Figure 3: Tuning the crossover and mutation process toward $S_2$. In (a), we set fewer models for mutation and more for crossover for both KNN and MLP algorithms. Our choice is motivated by the feedback received from the bad KNN mutation in $S_1$ and the fact that KNN and MLP perform almost identically for both independent classes (as illustrated in (b)). Similar to \ref{['fig:use_case1_model_s0']}, we investigate clusters in the projection, select a few models from each explored cluster shown in (c), and send the rest for crossover and mutation.
  • Figure 4: The outcome of the $S_2$ evolutionary optimization procedure and the final voting ensemble: (a) highlights that we have reached an impactful solution, since the models are not getting significantly better. Thus, we skip the addition of models from $S_2$. In (b), after an extensive exploration of the majority-voting ensemble, we end up with the selection of four models: M4 originating from the initial random search (the most performant when used individually) and M1--M3 from the crossover and mutation processes at $S_1$. We narrow down this selection even further by examining (c), where one MLP model appears to perform better for the Diseased class and two GradB models for the Healthy class. The active performance matches the best performance found so far (d). Hence, this is the most powerful majority-voting ensemble.
  • Figure 5: The exploration of clusters of interest that contain performant ML models. View (a) presents the user's selection that drive the analyses performed in the remaining subfigures. (b.1) provides an overview of the performance, showing that Ⓒ3 has underperforming KNN and GradB models. On the other hand, (b.2) shows that the user's choice of models retains both performance and diversity. In (c.1), we observe that g-mean and ROC AUC scores are very low, which is a problem investigated further in view (d.1). Those models appear to perform better for the hard-to-classify instances; however, this is a misconception. (c.2) gives supporting evidence to the user's selection, since all validation metrics are higher than the average values for all models, along with the in-depth visualization in (d.2).
  • ...and 1 more figures