# GECCO '18- Proceedings of the Genetic and Evolutionary Computation Conference

Full Citation in the ACM Digital Library## SESSION: Evolutionary machine learning

### Combating catastrophic forgetting with developmental compression

Generally intelligent agents exhibit successful behavior across problems in several settings. Endemic in approaches to realize such intelligence in machines is catastrophic forgetting: sequential learning corrupts knowledge obtained earlier in the sequence, or tasks antagonistically compete for system resources. Methods for obviating catastrophic forgetting have sought to identify and preserve features of the system necessary to solve one problem when learning to solve another, or to enforce modularity such that minimally overlapping sub-functions contain task specific knowledge. While successful, both approaches scale poorly because they require larger architectures as the number of training instances grows, causing different parts of the system to specialize for separate subsets of the data. Here we present a method for addressing catastrophic forgetting called developmental compression. It exploits the mild impacts of developmental mutations to lessen adverse changes to previously-evolved capabilities and 'compresses' specialized neural networks into a generalized one. In the absence of domain knowledge, developmental compression produces systems that avoid overt specialization, alleviating the need to engineer a bespoke system for every task permutation and suggesting better scalability than existing approaches. We validate this method on a robot control problem and hope to extend this approach to other machine learning domains in the future.

### Optimizing floating centroids method neural network classifier using dynamic multilayer particle swarm optimization

The floating centroids method (FCM) effectively enhances the performance of neural network classifiers. However, the problem of optimizing the neural network continues to restrict the further improvement of FCM. Traditional particle swarm optimization algorithm (PSO) sometimes converges to a local optimal solution in multimodal landscape, particularly for optimizing neural networks. Therefore, the dynamic multilayer PSO (DMLPSO) is proposed to optimize the neural network for improving the performance of FCM. DMLPSO adopts the basic concepts of multi-layer PSO to introduce a dynamic reorganizing strategy, which achieves that valuable information dynamically interacts among different subswarms. This strategy increases population diversity to promote the performance of DMLPSO when optimizing multimodal functions. Experimental results indicate that the proposed DMLPSO enables FCM to obtain improved solutions in many data sets.

### Autostacker: a compositional evolutionary learning system

In this work, an automatic machine learning (AutoML) modeling architecture called Autostacker is introduced. Autostacker combines an innovative hierarchical stacking architecture and an evolutionary algorithm (EA) to perform efficient parameter search without the need for prior domain knowledge about the data or feature preprocessing. Using EA, Autostacker quickly evolves candidate pipelines with high predictive accuracy. These pipelines can be used in their given form, or serve as a starting point for further augmentation and refinement by human experts. Autostacker finds innovative machine learning model combinations and structures, rather than selecting a single model and optimizing its hyperparameters. When its performance on fifteen datasets is compared with that of other AutoML systems, Autostacker produces superior or competitive results in terms of both test accuracy and time cost.

### Neuroevolution of hierarchical reservoir computers

Reservoir Computers such as Echo State Networks (ESN) represent an alternative recurrent neural network model that provides fast training and state-of-the-art performances for supervised learning problems. Classic ESNs suffer from two limitations; hyperparameter selection and learning of multiple temporal and spatial scales. To learn multiple scales, hierarchies are proposed, and to overcome manual tuning, optimisation is used. However, the autonomous design of hierarchies and optimisation of multi-reservoir systems has not yet been demonstrated. In this work, an evolvable architecture is proposed called *Reservoir-of-Reservoirs* (RoR) where sub-networks of neurons (ESNs) are interconnected to form the reservoir. The design of each sub-network, hyperparameters, global connectivity and its hierarchical structure are evolved using a genetic algorithm (GA) called the *Microbial* GA.

To evaluate the RoR and microbial GA, single networks and other hierarchical ESNs are evolved. The results show the microbial GA can dramatically increase the performance of single networks compared to other optimisation techniques. The evolutionary process also leads to competitive results with RoRs and other hierarchical ESNs, despite having fewer connections than a single network. In the final section, it is revealed that the RoR architecture may learn generalised features other architectures cannot, offering improvements in network generalisation to other tasks.

### Evolving bagging ensembles using a spatially-structured niching method

This paper presents a novel approach to constructing ensembles for prediction using a bootstrap aggregation (bagging) model. The proposed method uses analogies from ecological modelling to view bootstrap samples as a local adaptation resource in a spatially structured population. Through local competition and breeding, adaptation towards specific bootstrap samples takes place and the resulting ensemble emerges from a single global population in a single run. This makes better use of available computational resources, and negates the need for multiple runs typically required by a bagging approach. We examine the robustness of the method with respect to the number of bootstrap samples in the ensemble, and demonstrate that the resulting method also has a positive effect on bloat control. Finally, the effectiveness of the method relative to existing bagging approaches such as random forests is explored and encouraging performance is demonstrated on a range of benchmark problems.

### Online meta-learning by parallel algorithm competition

The efficiency of reinforcement learning algorithms depends critically on a few meta-parameters that modulate the learning updates and the trade-off between exploration and exploitation. The adaptation of the meta-parameters is an open question, which arguably has become a more important issue recently with the success of deep reinforcement learning. The long learning times in domains such as Atari 2600 video games makes it not feasible to perform comprehensive searches of appropriate meta-parameter values. In this study, we propose the Online Meta-learning by Parallel Algorithm Competition (OMPAC) method, which is a novel Lamarckian evolutionary approach to online meta-parameter adaptation. The population consists of several instances of a reinforcement learning algorithm which are run in parallel with small differences in initial meta-parameter values. After a fixed number of learning episodes, the instances are selected based on their performance on the task at hand, i.e., the fitness. Before continuing the learning, Gaussian noise is added to the meta-parameters with a predefined probability. We validate the OMPAC method by improving the state-of-the-art results in stochastic SZ-Tetris and in 10x10 Tetris by 31% and 84%, respectively, and by improving the learning speed and performance for deep Sarsa(λ) agents in the Atari 2600 domain.

### Evolved GANs for generating pareto set approximations

In machine learning, generative models are used to create data samples that mimic the characteristics of the training data. Generative adversarial networks (GANs) are neural-network based generator models that have shown their capacity to produce realistic samples in different domains. In this paper we propose a neuro-evolutionary approach for evolving deep GAN architectures together with the loss function and generator-discriminator synchronization parameters. We also propose the problem of Pareto set (PS) approximation as a suitable benchmark to evaluate the quality of neural-network based generators in terms of the accuracy of the solutions they generate. The covering of the Pareto front (PF) by the generated solutions is used as an indicator of the mode-collapsing behavior of GANs. We show that it is possible to evolve GANs that generate good PS approximations. Our method scales to up to 784 variables and that it is capable to create architecture transferable across dimensions and functions.

### Evolutionary expectation maximization

We establish a link between evolutionary algorithms (EAs) and learning of probabilistic generative models with binary hidden variables. Learning is formulated as approximate maximum likelihood optimization using variational expectation maximization. When choosing truncated posteriors as variational distributions, the variational parameters take the form of sets of latent states. By (A) identifying these sets with populations of genomes, and by (B) defining the fitness of individuals as the joint probability of the chosen generative model, the sets of latent states can be optimized using EAs. We obtain scalable learning algorithms that effectively improve the tractable free energy objective of truncated posteriors. While this novel approach is applicable to any model with binary latents and tractable joint probability, as a proof of concept, we here apply it to the optimization of parameters of noisy-OR Bayes Nets (modeling binary data) and Binary Sparse Coding (modelling continuous data). We find that the data likelihood is efficiently improved by employing genetic algorithms with point mutations and single-point cross-over as EAs. In general we believe that, with the novel link established here, standard as well as recent results in evolutionary optimization can be leveraged to address the difficult problem of parameter optimization in generative models.

### ES is more than just a traditional finite-difference approximator

An evolution strategy (ES) variant based on a simplification of a natural evolution strategy recently attracted attention because it performs surprisingly well in challenging deep reinforcement learning domains. It searches for neural network parameters by generating perturbations to the current set of parameters, checking their performance, and moving in the aggregate direction of higher reward. Because it resembles a traditional finite-difference approximation of the reward gradient, it can naturally be confused with one. However, this ES optimizes for a different gradient than just reward: It optimizes for the average reward of the entire population, thereby seeking parameters that are robust to perturbation. This difference can channel ES into distinct areas of the search space relative to gradient descent, and also consequently to networks with distinct properties. This unique robustness-seeking property, and its consequences for optimization, are demonstrated in several domains. They include humanoid locomotion, where networks from policy gradient-based reinforcement learning are significantly less robust to parameter perturbation than ES-based policies solving the same task. While the implications of such robustness and robustness-seeking remain open to further study this work's main contribution is to highlight such differences and their potential importance.

### Automatically evolving difficult benchmark feature selection datasets with genetic programming

There has been a wealth of feature selection algorithms proposed in recent years, each of which claims superior performance in turn. A wide range of datasets have been used to compare these algorithms, each with different characteristics and quantities of redundant and noisy features. Hence, it is very difficult to comprehensively and fairly compare these feature selection methods in order to find which are most robust and effective. In this work, we examine using Genetic Programming to automatically synthesise redundant features for augmenting existing datasets in order to more scientifically test feature selection performance. We develop a method for producing complex multi-variate redundancies, and present a novel and intuitive approach to ensuring a range of redundancy relationships are automatically created. The application of these augmented datasets to well-established feature selection algorithms shows a number of interesting and useful results and suggests promising directions for future research in this area.

### Evolutionary architecture search for deep multitask networks

Multitask learning, i.e. learning several tasks at once with the same neural network, can improve performance in each of the tasks. Designing deep neural network architectures for multitask learning is a challenge: There are many ways to tie the tasks together, and the design choices matter. The size and complexity of this problem exceeds human design ability, making it a compelling domain for evolutionary optimization. Using the existing state of the art soft ordering architecture as the starting point, methods for evolving the modules of this architecture and for evolving the overall topology or routing between modules are evaluated in this paper. A synergetic approach of evolving custom routings with evolved, shared modules for each task is found to be very powerful, significantly improving the state of the art in the Omniglot multitask, multialphabet character recognition domain. This result demonstrates how evolution can be instrumental in advancing deep neural network and complex system design in general.

### Divide and conquer: neuroevolution for multiclass classification

Neuroevolution is a powerful and general technique for evolving the structure and weights of artificial neural networks. Though neuroevolutionary approaches such as NeuroEvolution of Augmenting Topologies (NEAT) have been successfully applied to various problems including classification, regression, and reinforcement learning problems, little work has explored application of these techniques to larger-scale multiclass classification problems. In this paper, NEAT is evaluated in several multiclass classification problems, and then extended via two ensemble approaches: One-vs-All and One-vs-One. These approaches decompose multiclass classification problems into a set of binary classification problems, in which each binary problem is solved by an instance of NEAT. These ensemble models exhibit reduced variance and increasingly superior accuracy as the number of classes increases. Additionally, higher accuracy is achieved early in training, even when artificially constrained for the sake of fair comparison with standard NEAT. However, because the approach can be trivially distributed, it can be applied quickly at large scale to solve real problems. In fact, these approaches are incorporated into Darwin^{TM}, an enterprise automatic machine learning solution that also incorporates various other algorithmic enhancements to NEAT. The resulting complete system has proven robust to a wide variety of client datasets.

### Theoretical adaptation of multiple rule-generation in XCS

Most versions of the XCS Classifier System have been designed to evolve only two rules for each rule discovery invocation, which restricts the search capacity. A difficulty behind generating multiple rules each time is the increase in the probability of deleting immature rules, which conflicts with the requirement that parent rules be sufficiently updated so that fitness represents worth. Thus the aim of this paper is to argue how XCS determines when rules can be deleted safely. The objectives are to certainly identify inaccurate rules and then to maximize how many rules XCS can generate. The proposed method enables adaptation of rule-generation that maximizes the number of generated rules, under the assumption that the reliably inaccurate rules can be replaced with new rules. Experiments show our modification strongly improves the XCS performance on large scale problems, since it can take advantage of multi-point search more efficiently. For example, on the 135-bit multiplexer problem, XCS with our modification requires 1.57 million less training inputs compared with the standard XCS while utilizing the same number of final rules.

### NEAT for large-scale reinforcement learning through evolutionary feature learning and policy gradient search

NeuroEvolution of Augmenting Topology (NEAT) is one of the most successful algorithms for solving traditional reinforcement learning (RL) tasks such as pole-balancing. However, the algorithm faces serious challenges while tackling problems with large state spaces, particularly the Atari game playing tasks. This is due to the major flaw that NEAT aims at evolving a single neural network (NN) that must be able to simultaneously extract high-level state features and select action outputs. However such complicated NNs cannot be easily evolved directly through NEAT. To address this issue, we propose a new reinforcement learning scheme based on NEAT with two key technical advancements: (1) a new three-stage learning scheme is introduced to clearly separate feature learning and policy learning to allow effective knowledge sharing and learning across multiple agents; (2) various policy gradient search algorithms can be seamlessly integrated with NEAT for training policy networks with deep structures to achieve effective and sample efficient RL. Experiments on several Atari games confirm that our new learning scheme can be more effective and has higher sample efficiency than NEAT and three state-of-the-art algorithms from the most recent RL literature.

### A genetic algorithm for finding an optimal curing strategy for epidemic spreading in weighted networks

Contact networks have been recognized to have a central role in the dynamic behavior of spreading processes. The availability of cost-optimal curing strategies, able to control the epidemic propagation, are of primary importance for the design of efficient treatments reducing the number of infected individuals and the extinction time of the infection. In this paper, we investigate the use of Genetic Algorithms for solving the problem of finding an optimal curing strategy in a network where a virus spreads following the Susceptible-Infected-Susceptible (SIS) epidemic model. Exploiting the N-Intertwined Mean-Field Approximation (NIMFA) of the SIS spreading process, we propose a constrained genetic algorithm which determines specific curing rates to each node composing the network, in order to minimize the total curing cost, while suppressing the epidemic. Experiments on both synthetic and real-world networks show that the approach finds solutions whose curing cost is lower than that obtained by a classical baseline method.

### Memetic evolution of deep neural networks

Deep neural networks (DNNs) have proven to be effective at solving challenging problems, but their success relies on finding a good architecture to fit the task. Designing a DNN requires expert knowledge and a lot of trial and error, especially as the difficulty of the problem grows. This paper proposes a fully automatic method with the goal of optimizing DNN topologies through memetic evolution. By recasting the mutation step as a series of progressively refined educated local-search moves, this method achieves results comparable to best human designs. Our extensive experimental study showed that the proposed memetic algorithm supports building a real-world solution for segmenting medical images, it exhibits very promising results over a challenging CIFAR-10 benchmark, and works very fast. Given the ever growing availability of data, our memetic algorithm is a very promising avenue for hands-free DNN architecture design to tackle emerging classification tasks.

### Cooperative multi-objective evolutionary support vector machines for multiclass problems

In recent years, evolutionary algorithms have been found to be effective and efficient techniques to train support vector machines (SVMs) for binary classification problems while multiclass problems have been neglected. This paper proposes CMOE-SVM: Cooperative Multi-Objective Evolutionary SVMs for multiclass problems. CMOE-SVM enables SVMs to handle multiclass problems via co-evolutionary optimization, by breaking down the original *M*-class problem into *M* simpler ones, which are optimized simultaneously in a cooperative manner. Furthermore, CMOE-SVM can explicitly maximize the margin and reduce the training error (the two components of the SVM optimization), by means of multi-objective optimization. Through a comprehensive experimental evaluation using a suite of benchmark datasets, we validate the performance of CMOE-SVM. The experimental results, supported by statistical tests, give evidence of the effectiveness of the proposed approach for solving multiclass classification problems.

### Towards an adaptive encoding for evolutionary data clustering

A key consideration in developing optimization approaches for data clustering is choice of a suitable encoding. Existing encodings strike different trade-offs between model and search complexity, limiting the applicability to data sets with particular properties or to problems of moderate size. Recent research has introduced an additional hyperparameter to directly govern the encoding granularity in the multi-objective clustering algorithm MOCK. Here, we investigate adapting this important hyperparameter during run-time. In particular, we consider a number of different trigger mechanisms to control the timing of changes to this hyperparameter and strategies to rapidly explore the newly "opened" search space resulting from this change. Experimental results illustrate distinct performance differences between the approaches tested, which can be explained in light of the relative importance of initialization, crossover and mutation in MOCK. The most successful strategies meet the clustering performance achieved for an optimal (a priori) setting of the hyperparameter, at a ~40% reduction of computational expense.

### CovSel: A new approach for ensemble selection applied to symbolic regression problems

Ensemble methods combine the predictions of a set of models to reach a better prediction quality compared to a single model's prediction. The ensemble process consists of three steps: 1) the generation phase where the models are created, 2) the selection phase where a set of possible ensembles is composed and one is selected by a selection method, 3) the fusion phase where the individual models' predictions of the selected ensemble are combined to an ensemble's estimate. This paper proposes CovSel, a selection approach for regression problems that ranks ensembles based on the coverage of adequately estimated training points and selects the ensemble with the highest coverage to be used in the fusion phase. An ensemble covers a training point if at least one of its models produces an adequate prediction for this training point. The more training points are covered this way, the higher is the ensemble's coverage. The selection of the "right" ensemble has a large impact on the final prediction. Results for two symbolic regression problems show that CovSel improves the predictions for various state-of-the-art fusion methods for ensembles composed of independently evolved GP models and also beats approaches based on single GP models.

### What about interpolation?: a radial basis function approach to classifier prediction modeling in XCSF

Learning Classifier Systems (LCS) have been strongly investigated in the context of regression tasks and great successes have been achieved by applying the function approximating Extended Classifier System (XCSF) endowed with sophisticated prediction models. In this paper, a novel approach to model a classifier's payoff prediction is proposed. Radial Basis Function (RBF) interpolation is utilized as a new means to capture the underlying function surface complexity. We pose the hypothesis that by the use of a more flexible RBF-based classifier prediction, that alleviates the a priori bias injected via choosing the degree of a polynomial approximation, the classifiers can evolve toward a higher generality by maintaining at least a competitive level of performance compared to the current and probably mostly used state of the art approach - polynomial approximation in combination with the Recursive Least Squares (RLS) technique for incremental coefficient optimization. The presented experimental results underpin our assumptions by revealing that the RBF-based classifier prediction outperforms the *n*-th order polynomial approximation on several test functions of varying complexity. Additionally, results of experiments with various degrees of noise will be reported to touch upon the proposed approach's applicability in real world situations.

### Efficient sample reuse in policy search by multiple importance sampling

Policy search such as reinforcement learning and evolutionary computation is a framework for finding an optimal policy of control problems, but it usually requires a huge number of samples. Importance sampling is a common tool to use samples drawn from a proposal distribution different from the targeted one, and it is widely used by the policy search methods to update the policy from a set of datasets that are collected by previous sampling distributions. However, the proposal distribution is created by a mixture of previous distributions with fixed mixing weights in most of previous studies, and it is often numerically unstable. To overcome this problem, we propose the method of adaptive multiple importance sampling that optimizes the mixing coefficients to minimize the variance of the importance sampling estimator while utilizing as many samples as possible. We apply the proposed method to the five policy search methods such as PGPE, PoWER, CMA-ES, REPS, and NES, and their algorithms are evaluated by some benchmark control tasks. Experimental results show that all the five methods improve sample efficiency. In addition, we show that optimizing the mixing weights achieves stable learning.

### Attribute tracking: strategies towards improved detection and characterization of complex associations

The detection, modeling and characterization of complex patterns of association in bioinformatics has focused on feature interactions and, more recently, instance-subgroup specific associations (e.g. genetic heterogeneity). Previously, attribute tracking was proposed as an instance-linked memory approach, leveraging the incremental learning of learning classifier systems (LCSs) to track which features were most useful in making class predictions within individual instances. These 'attribute tracking' signatures could later be used to characterize patterns of association in the data. While effective, true underlying patterns remain difficult to characterize in noisy problems, and the original approach places equal weight on tracked feature scores obtained early as well as late in learning. In this work we investigate alternative strategies for attribute tracking scoring, including the adoption of a time recency update scheme taken from reinforcement learning, to gain insight into how to optimize this approach to improve modeling performance and downstream pattern interpretability. We report mixed results over a variety of performance metrics that point to promising future directions for building effective building blocks and improving model interpretability.

### Ensembles of evolved nested dichotomies for classification

In multinomial classification, reduction techniques are commonly used to decompose the original learning problem into several simpler problems. For example, by recursively bisecting the original set of classes, so-called nested dichotomies define a set of binary classification problems that are organized in the structure of a binary tree. In contrast to the existing one-shot heuristics for constructing nested dichotomies and motivated by recent work on algorithm configuration, we propose a genetic algorithm for optimizing the structure of such dichotomies. A key component of this approach is the proposed genetic representation that facilitates the application of standard genetic operators, while still supporting the exchange of partial solutions under recombination. We evaluate the approach in an extensive experimental study, showing that it yields classifiers with superior generalization performance.

### Limited evaluation cooperative co-evolutionary differential evolution for large-scale neuroevolution

Many real-world control and classification tasks involve a large number of features. When artificial neural networks (ANNs) are used for modeling these tasks, the network architectures tend to be large. Neuroevolution is an effective approach for optimizing ANNs; however, there are two bottlenecks that make their application challenging in case of high-dimensional networks using direct encoding. First, classic evolutionary algorithms tend not to scale well for searching large parameter spaces; second, the network evaluation over a large number of training instances is in general time-consuming. In this work, we propose an approach called the *Limited Evaluation Cooperative Co-evolutionary Differential Evolution algorithm (LECCDE)* to optimize high-dimensional ANNs.

The proposed method aims to optimize the pre-synaptic weights of each post-synaptic neuron in different subpopulations using a Cooperative Co-evolutionary Differential Evolution algorithm, and employs a limited evaluation scheme where fitness evaluation is performed on a relatively small number of training instances based on fitness inheritance. We test LECCDE on three datasets with various sizes, and our results show that cooperative co-evolution significantly improves the test error comparing to standard Differential Evolution, while the limited evaluation scheme facilitates a significant reduction in computing time.

### Evolutionary feature subspaces generation for ensemble classification

Ensemble learning is a powerful machine learning paradigm which leverages a collection of diverse base learners to achieve better prediction performance than that could be achieved by any individual base learner. This work proposes an evolutionary feature subspaces generation based ensemble learning framework, which formulates the tasks of searching for the most suitable feature subspace for each base learner into a multi-task optimization problem and solve it via an evolutionary multi-task optimizer. Multiple such problems which correspond to different base learners are solved simultaneously via an evolutionary multi-task feature selection algorithm such that solving one problem may help solve some other problems via implicit knowledge transfer. The quality of thus generated feature subspaces is supposed to outperform those obtained by individually seeking the optimal feature subspace for each base learner. We implement the proposed framework by using SVM, KNN, and decision tree as the base learners, proposing a multi-task binary particle swarm optimization algorithm for evolutionary multi-task feature selection, and utilizing the major voting scheme to combine the outputs of the base learners. Experiments on several UCI datasets demonstrate the effectiveness of the proposed method.