GECCO '21: Proceedings of the Genetic and Evolutionary Computation ConferenceFull Citation in the ACM Digital Library
SESSION: Evolutionary machine learning
Metalearning of deep neural network (DNN) architectures and hyperparameters has become an increasingly important area of research. Loss functions are a type of metaknowledge that is crucial to effective training of DNNs, however, their potential role in metalearning has not yet been fully explored. Whereas early work focused on genetic programming (GP) on tree representations, this paper proposes continuous CMA-ES optimization of multivariate Taylor polynomial parameterizations. This approach, TaylorGLO, makes it possible to represent and search useful loss functions more effectively. In MNIST, CIFAR-10, and SVHN benchmark tasks, TaylorGLO finds new loss functions that outperform the standard cross-entropy loss as well as novel loss functions previously discovered through GP, in fewer generations. These functions serve to regularize the learning task by discouraging overfitting to the labels, which is particularly useful in tasks where limited training data is available. The results thus demonstrate that loss function optimization is a productive new avenue for metalearning.
In cluster analysis, the automatic clustering problem refers to the determination of both the appropriate number of clusters and the corresponding natural partitioning. This can be addressed as an optimization problem in which a cluster validity index (CVI) is used as a fitness function to evaluate the quality of potential solutions. Different CVIs have been proposed in the literature, aiming to identify adequate cluster solutions in terms of intracluster cohesion and intercluster separation. However, it is important to identify the scenarios in which these CVIs perform well and their limitations. This paper evaluates the effectiveness of 22 different CVIs used as fitness functions in an evolutionary clustering algorithm named ACDE based on differential evolution. Several synthetic datasets are considered: linearly separable data having both well-separated and overlapped clusters, and non-linearly separable data having arbitrarily-shaped clusters. Besides, real-life datasets are also considered. The experimental results indicate that the Silhouette index consistently reached an acceptable performance in linearly separable data. Furthermore, the indices Calinski-Harabasz, Davies-Bouldin, and generalized Dunn obtained an adequate clustering performance in synthetic and real-life datasets. Notably, all the evaluated CVIs performed poorly in clustering the non-linearly separable data because of the assumptions about data distributions.
Metalearning of deep neural network (DNN) architectures and hyperparameters has become an increasingly important area of research. At the same time, network regularization has been recognized as a crucial dimension to effective training of DNNs. However, the role of metalearning in establishing effective regularization has not yet been fully explored. There is recent evidence that loss-function optimization could play this role, however it is computationally impractical as an outer loop to full training. This paper presents an algorithm called Evolutionary Population-Based Training (EPBT) that interleaves the training of a DNN's weights with the metalearning of loss functions. They are parameterized using multivariate Taylor expansions that EPBT can directly optimize. Such simultaneous adaptation of weights and loss functions can be deceptive, and therefore EPBT uses a quality-diversity heuristic called Novelty Pulsation as well as knowledge distillation to prevent overfitting during training. On the CIFAR-10 and SVHN image classification benchmarks, EPBT results in faster, more accurate learning. The discovered hyperparameters adapt to the training process and serve to regularize the learning task by discouraging overfitting to the labels. EPBT thus demonstrates a practical instantiation of regularization metalearning based on simultaneous training.
The XCS classifier system adaptively controls a rule-generality of a rule-condition through a rule-discovery process. However, there is no proof that the rule-generality can eventually converge to its optimum value even under some ideal assumptions. This paper conducts a convergence analysis of the rule-generality on the rule-discovery process with the ternary alphabet coding. Our analysis provides the first proof that an average rule-generality of rules in a population can converge to its optimum value under some assumptions. This proof can be used to mathematically conclude that the XCS framework has a natural pressure to explore rules toward optimum rules if XCS satisfies our derived conditions. In addition, our theoretical result returns a rough setting-up guideline for the maximum population size, the mutation rate, and the GA threshold, improving the convergence speed of the rule-generality and the XCS performance.
An effective action covering for multi-label learning classifier systems: a graph-theoretic approach
In Multi-label (ML) classification, each instance is associated with more than one class label. Incorporating the label correlations into the model is one of the increasingly studied areas in the last decade, mostly due to its potential in training more accurate predictive models and dealing with noisy/missing labels. Previously, multi-label learning classifier systems have been proposed that incorporate the high-order label correlations into the model through the label powerset (LP) technique. However, such a strategy cannot take advantage of the valuable statistical information in the label space to make more accurate inferences. Such information becomes even more crucial in ML image classification problems where the number of labels can be very large. In this paper, we propose a multi-label learning classifier system that leverages a structured representation for the labels through undirected graphs to utilize the label similarities when evolving rules. More specifically, we propose a novel scheme for covering classifier actions, as well as a method to calculate ML prediction arrays. The effectiveness of this method is demonstrated by experimenting on multiple benchmark datasets and comparing the results with multiple well-known ML classification algorithms.
In classification, when class overlap is intertwined with the issue of class imbalance, it is often challenging to discover useful patterns because of an ambiguous boundary between the majority class and the minority class. This becomes more difficult if the data is high-dimensional. To date, very few pieces of work have investigated how the class overlap issue can be effectively addressed or alleviated in classification with high-dimensional unbalanced data. In this paper, we propose a new genetic programming based method, which is able to automatically and directly detect borderline instances, in order to address the class overlap issue in classification with high-dimensional unbalanced data. In the proposed method, each individual has two trees to be trained together based on different classification rules. The proposed method is examined and compared with baseline methods on high-dimensional unbalanced datasets. Experimental results show that the proposed method achieves better classification performance than the baseline methods in almost all cases.
We put forward a novel learning methodology for ensembles of decision trees based on a genetic algorithm that is able to train a decision tree for maximizing both its accuracy and its robustness to adversarial perturbations. This learning algorithm internally leverages a complete formal verification technique for robustness properties of decision trees based on abstract interpretation, a well-known static program analysis technique. We implemented this genetic adversarial training algorithm in a tool called MetaSilvae and we experimentally evaluated it on some standard reference datasets used in adversarial training. The experimental results show that MetaSilvae is able to train robust models that compete with and often improve on the current state-of-the-art of adversarial training of decision trees while being much more compact and therefore interpretable and efficient tree models.
Automated machine learning (AutoML) strives for automatically constructing and configuring compositions of machine learning algorithms, called pipelines, with the goal to optimize a suitable performance measure on a concrete learning task. So far, most AutoML tools are focused on standard problem classes, such as classification and regression. In the field of predictive maintenance, especially the estimation of remaining useful lifetime (RUL), the task of AutoML becomes more complex. In particular, a good feature representation for multivariate sensor data is essential to achieve good performance. Due to the need for methods generating feature representations, the search space of candidate pipelines enlarges. Moreover, the runtime of a single pipeline increases substantially. In this paper, we tackle these problems by partitioning the search space into two sub-spaces, one for feature extraction methods and one for regression methods, and employ cooperative coevolution for searching a good combination. Thereby, we benefit from the fact that the generated feature representations can be cached, whence the evaluation of multiple regressors based on the same feature representation speeds up, allowing the evaluation of more candidate pipelines. Experimentally, we show that our coevolutionary strategy performs superior to the baselines.
Generative adversarial networks (GANs) exhibit training pathologies that can lead to convergence-related degenerative behaviors, whereas spatially-distributed, coevolutionary algorithms (CEAs) for GAN training, e.g. Lipizzaner, are empirically robust to them. The robustness arises from diversity that occurs by training populations of generators and discriminators in each cell of a toroidal grid. Communication, where signals in the form of parameters of the best GAN in a cell propagate in four directions: North, South, West and East, also plays a role, by communicating adaptations that are both new and fit. We propose Lipi-Ring, a distributed CEA like Lipizzaner, except that it uses a different spatial topology, i.e. a ring. Our central question is whether the different directionality of signal propagation (effectively migration to one or more neighbors on each side of a cell) meets or exceeds the performance quality and training efficiency of Lipizzaner. Experimental analysis on different datasets (i.e, MNIST, CelebA, and COVID-19 chest X-ray images) shows that there are no significant differences between the performances of the trained generative models by both methods. However, Lipi-Ring significantly reduces the computational time (14.2%... 41.2%). Thus, Lipi-Ring offers an alternative to Lipizzaner when the computational cost of training matters.
A systematic comparison study on hyperparameter optimisation of graph neural networks for molecular property prediction
Graph neural networks (GNNs) have been proposed for a wide range of graph-related learning tasks. In particular, in recent years, an increasing number of GNN systems were applied to predict molecular properties. However, a direct impediment is to select appropriate hyperparameters to achieve satisfactory performance with lower computational cost. Meanwhile, many molecular datasets are far smaller than many other datasets in typical deep learning applications. Most hyperparameter optimization (HPO) methods have not been explored in terms of their efficiencies on such small datasets in the molecular domain. In this paper, we conducted a theoretical analysis of common and specific features for two state-of-the-art and popular algorithms for HPO: TPE and CMA-ES, and we compared them with random search (RS), which is used as a baseline. Experimental studies are carried out on several benchmarks in MoleculeNet, from different perspectives to investigate the impact of RS, TPE, and CMA-ES on HPO of GNNs for molecular property prediction. In our experiments, we concluded that RS, TPE, and CMA-ES have their individual advantages in tackling different specific molecular problems. Finally, we believe our work will motivate further research on GNN as applied to molecular machine learning problems in chemistry and materials sciences.