GECCO '19- Proceedings of the Genetic and Evolutionary Computation ConferenceFull Citation in the ACM Digital Library
SESSION: Genetic programming
We introduce a form of neutral Horizontal Gene Transfer (HGT) to Evolving Graphs by Graph Programming (EGGP). We introduce the µ × λ evolutionary algorithm, where µ parents each produce λ children who compete with only their parents. HGT events then copy the entire active component of one surviving parent into the inactive component of another parent, exchanging genetic information without reproduction. Experimental results from 14 symbolic regression benchmark problems show that the introduction of the µ × λ EA and HGT events improve the performance of EGGP. Comparisons with Genetic Programming and Cartesian Genetic Programming strongly favour our proposed approach.
In many applications of symbolic regression, domain knowledge constrains the space of admissible models by requiring them to have certain properties, like monotonicity, convexity, or symmetry. As only a handful of variants of genetic programming methods proposed to date can take such properties into account, we introduce a principled approach capable of synthesizing models that simultaneously match the provided training data (tests) and meet user-specified formal properties. To this end, we formalize the task of symbolic regression with formal constraints and present a range of formal properties that are common in practice. We also conduct a comparative experiment that confirms the feasibility of the proposed approach on a suite of realistic symbolic regression benchmarks extended with various formal properties. The study is summarized with discussion of results, properties of the method, and implications for symbolic regression.
Genetic improvement (GI) is a young field of research on the cusp of transforming software development. GI uses search to improve existing software. Researchers have already shown that GI can improve human-written code, ranging from program repair to optimising run-time, from reducing energy-consumption to the transplantation of new functionality. Much remains to be done. The cost of re-implementing GI to investigate new approaches is hindering progress. Therefore, we present Gin, an extensible and modifiable toolbox for GI experimentation, with a novel combination of features. Instantiated in Java and targeting the Java ecosystem, Gin automatically transforms, builds, and tests Java projects. Out of the box, Gin supports automated test-generation and source code profiling. We show, through examples and a case study, how Gin facilitates experimentation and will speed innovation in GI.
Batch tournament selection for genetic programming: the quality of lexicase, the speed of tournament
Lexicase selection achieves very good solution quality by introducing ordered test cases. However, the computational complexity of lexicase selection can prohibit its use in many applications. In this paper, we introduce Batch Tournament Selection (BTS), a hybrid of tournament and lexicase selection which is approximately one order of magnitude faster than lexicase selection while achieving a competitive quality of solutions. Tests on a number of regression datasets show that BTS compares well with lexicase selection in terms of mean absolute error while having a speed-up of up to 25 times. Surprisingly, BTS and lexicase selection have almost no difference in both diversity and performance. This reveals that batches and ordered test cases are completely different mechanisms which share the same general principle fostering the specialization of individuals. This work introduces an efficient algorithm that sheds light onto the main principles behind the success of lexicase, potentially opening up a new range of possibilities for algorithms to come.
Recently it has been proved that simple GP systems can efficiently evolve the conjunction of n variables if they are equipped with the minimal required components. In this paper, we make a considerable step forward by analysing the behaviour and performance of a GP system for evolving a Boolean function with unknown components, i.e. the target function may consist of both conjunctions and disjunctions. We rigorously prove that if the target function is the conjunction of n variables, then a GP system using the complete truth table to evaluate program quality evolves the exact target function in O(ℓ n log2 n) iterations in expectation, where ℓ ≥ n is a limit on the size of any accepted tree. Additionally, we show that when a polynomial sample of possible inputs is used to evaluate solution quality, conjunctions with any polynomially small generalisation error can be evolved with probability 1 - O(log2(n)/n). To produce our results we introduce a super-multiplicative drift theorem that gives significantly stronger runtime bounds when the expected progress is only slightly super-linear in the distance from the optimum.
What's inside the black-box?: a genetic programming method for interpreting complex machine learning models
Interpreting state-of-the-art machine learning algorithms can be difficult. For example, why does a complex ensemble predict a particular class? Existing approaches to interpretable machine learning tend to be either local in their explanations, apply only to a particular algorithm, or overly complex in their global explanations. In this work, we propose a global model extraction method which uses multi-objective genetic programming to construct accurate, simplistic and model-agnostic representations of complex black-box estimators. We found the resulting representations are far simpler than existing approaches while providing comparable reconstructive performance. This is demonstrated on a range of datasets, by approximating the knowledge of complex black-box models such as 200 layer neural networks and ensembles of 500 trees, with a single tree.
The study of semantics in Genetic Programming (GP) has increased dramatically over the last years due to the fact that researchers tend to report a performance increase in GP when semantic diversity is promoted. However, the adoption of semantics in Evolutionary Multi-objective Optimisation (EMO), at large, and in Multi-objective GP (MOGP), in particular, has been very limited and this paper intends to fill this challenging research area. We propose a mechanism wherein a semantic-based distance is used instead of the widely known crowding distance and is also used as an objective to be optimised. To this end, we use two well-known EMO algorithms: NSGA-II and SPEA2. Results on highly unbalanced binary classification tasks indicate that the proposed approach produces more and better results than the rest of the three other approaches used in this work, including the canonical aforementioned EMO algorithms.
Lexicase parent selection filters the population by considering one random training case at a time, eliminating any individuals with errors for the current case that are worse than the best error in the selection pool, until a single individual remains. This process often stops before considering all training cases, meaning that it will ignore the error values on any cases that were not yet considered. Lexicase selection can therefore select specialist individuals that have poor errors on some training cases, if they have great errors on others and those errors come near the start of the random list of cases used for the parent selection event in question. We hypothesize here that selecting these specialists, which may have poor total error, plays an important role in lexicase selection's observed performance advantages over error-aggregating parent selection methods such as tournament selection, which select specialists much less frequently. We conduct experiments examining this hypothesis, and find that lexicase selection's performance and diversity maintenance degrade when we deprive it of the ability of selecting specialists. These findings help explain the improved performance of lexicase selection compared to tournament selection, and suggest that specialists help drive evolution under lexicase selection toward global solutions.
Programmers solve coding problems with the support of both programming and problem specific knowledge. They integrate this domain knowledge to reason by computational abstraction. Correct and readable code arises from sound abstractions and problem solving. We attempt to transfer insights from such human expertise to genetic programming (GP) for solving automatic program synthesis. We draw upon manual and non-GP Artificial Intelligence methods to extract knowledge from synthesis problem definitions to guide the construction of the grammar that Grammatical Evolution uses and to supplement its fitness function. We examine the impact of using such knowledge on 21 problems from the GP program synthesis benchmark suite. Additionally, we investigate the compounding impact of this knowledge and novelty search. The resulting approaches exhibit improvements in accuracy on a majority of problems in the field's benchmark suite of program synthesis problems.
Lexicase selection and novelty search, two parent selection methods used in evolutionary computation, emphasize exploring widely in the search space more than traditional methods such as tournament selection. However, lexicase selection is not explicitly driven to select for novelty in the population, and novelty search suffers from lack of direction toward a goal, especially in unconstrained, highly-dimensional spaces. We combine the strengths of lexicase selection and novelty search by creating a novelty score for each test case, and adding those novelty scores to the normal error values used in lexicase selection. We use this new novelty-lexicase selection to solve automatic program synthesis problems, and find it significantly outperforms both novelty search and lexicase selection. Additionally, we find that novelty search has very little success in the problem domain of program synthesis. We explore the effects of each of these methods on population diversity and long-term problem solving performance, and give evidence to support the hypothesis that novelty-lexicase selection resists converging to local optima better than lexicase selection.
Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during crossover. A forward stagewise crossover operator we propose leads to significant improvements on a set of regression problems, and produces state-of-the-art results in a large benchmark study. We discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. Finally, we look at the collinearity and complexity of the data representations that result from these architectures, with a view towards disentangling factors of variation in application.
Teaching GP to program like a human software developer: using perplexity pressure to guide program synthesis approaches
Program synthesis is one of the relevant applications of GP with a strong impact on new fields such as genetic improvement. In order for synthesized code to be used in real-world software, the structure of the programs created by GP must be maintainable. We can teach GP how real-world software is built by learning the relevant properties of mined human-coded software - which can be easily accessed through repository hosting services such as GitHub. So combining program synthesis and repository mining is a logical step. In this paper, we analyze if GP can write programs with properties similar to code produced by human software developers. First, we compare the structure of functions generated by different GP initialization methods to a mined corpus containing real-world software. The results show that the studied GP initialization methods produce a totally different combination of programming language elements in comparison to real-world software. Second, we propose perplexity pressure and analyze how its use changes the properties of code produced by GP. The results are very promising and show that we can guide the search to the desired program structure. Thus, we recommend using perplexity pressure as it can be easily integrated in various search-based algorithms.
In linear variants of Genetic Programming (GP) like linear genetic programming (LGP), structural introns can emerge, which are nodes that are not connected to the final output and do not contribute to the output of a program. There are claims that such non-effective code is beneficial for search, as it can store relevant and important evolved information that can be reactivated in later search phases. Furthermore, introns can increase diversity, which leads to higher GP performance. This paper studies the role of non-effective code by comparing the performance of LGP variants that deal differently with non-effective code for standard symbolic regression problems. As we find no decrease in performance when removing or randomizing structural introns in each generation of a LGP run, we have to reject the hypothesis that structural introns increase LGP performance by preserving meaningful sub-structures. Our results indicate that there is no important information stored in structural introns. In contrast, we find evidence that the increase of diversity due to structural introns positively affects LGP performance.
Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression
Semantic Backpropagation (SB) is a recent technique that promotes effective variation in tree-based genetic programming. The basic idea of SB is to provide information on what output is desirable for a specified tree node, by propagating the desired root-node output back to the specified node using inversions of functions encountered along the way. Variation operators then replace the subtree located at the specified node with a tree for which the output is closest to the desired output, by searching in a pre-computed library. In this paper, we propose two contributions to enhance SB specifically for symbolic regression, by incorporating the principles of Keijzer's Linear Scaling (LS). In particular, we show how SB can be used in synergy with the scaled mean squared error, and we show how LS can be adopted within library search. We test our adaptations using the well-known variation operator Random Desired Operator (RDO), comparing to its baseline implementation, and to traditional crossover and mutation. Our experimental results on real-world datasets show that SB enhanced with LS substantially improves the performance of RDO, resulting in overall the best performance among all tested GP algorithms.
The Uncertain Capacitated Arc Routing Problem (UCARP) is an important problem with many real-world applications. A major challenge in UCARP is to handle the uncertain environment effectively and reduce the recourse cost upon route failures. Genetic Programming Hyper-heuristic (GPHH) has been successfully applied to automatically evolve effective routing policies to make real-time decisions in the routing process. However, most existing studies obtain a single complex routing policy which is hard to interpret. In this paper, we aim to evolve an ensemble of simpler and more interpretable routing policies than a single complex policy. By considering the two critical properties of ensemble learning, i.e., the effectiveness of each ensemble element and the diversity between them, we propose two novel ensemble GP approaches namely DivBaggingGP and DivNichGP. DivBaggingGP evolves the ensemble elements sequentially, while DivNichGP evolves them simultaneously. The experimental results showed that both DivBaggingGP and DivNichGP could obtain more interpretable routing policies than the single complex routing policy. DivNichGP can achieve better test performance than DivBaggingGP as well as the single routing policy evolved by the current state-of-the-art GPHH. This demonstrates the effectiveness of evolving both effective and interpretable routing policies using ensemble learning.