From GPBenchmarks

Here, we classify some common problems according to domain, difficulty, and/or purpose.

**This collection of problems does NOT constitute a draft or proposed benchmark suite. It is an incomplete list of problems which have been used in previous work.**

**I've often dreamed of writing a paper "Death to Quartic Polynomial" because it's such a waste of time, giving worse-than-zero information on GP performance for anything real** -- Trent McConaghy, private communication.

Floating-point symbolic regression, polynomials - quartic, quintic, single trigonometric/exponential functions, e.g. from Koza-1, Binomial-N, etc.

- Sun-spot prediction. Koza-I.

- Mackey-Glass chaotic time-series. Langdon and Banzhaf 2005.

Natural Computing, Volume 7, Number 4, 589-613, DOI: 10.1007/s11047-007-9038-8

Key issues include:

- Range (a problem which is easy on [-1, 1] may be very difficult on [-10, 10]);

- The use of transcendental functions such as sin and exp in the target;

- The choice of functions available to evolution, ie alphabet;

- Interpolation/extrapolation.

Interpolation (relatively easy), extrapolation of existing trends (much more difficult), and extrapolation to new trends (eg the plateau of a logistic function -- impossible?).

- Dimensionality

- Sampling

The exact form of the target function, as opposed to a close approximation, may depend on having sufficient sampled data, which in turn depends on dimensionality. For low numbers of dimensions, GP can hope to find the solution exactly. For larger numbers of dimensions, close approximation may be possible even if exact matching is not. For very large numbers of dimensions, it may be impossible to do better than determining which dimensions are of relevance.

[Semantically-based Crossover in Genetic Programming: Application to Real-valued Symbolic Regression] Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, R.I. McKay, Edgar Galvan-Lopez, GPEM 2011.

"**in current state-of-the-art symbolic regression engines, accuracy is poor**"

Accuracy in Symbolic Regression -- GPTP 2011

"Our testing regimen uses only statistical best practices out-of-sample testing techniques. We test each of the test cases on matrices of 10000 rows by 1 to 5 columns with no noise. For each test a training matrix is filled with random numbers between -50 and +50. The test case target expressions are limited to one basis function whose maximum depth is three grammar nodes."

Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming Ekaterina J. Vladislavleva, Member, IEEE, Guido F. Smits, Member, IEEE, and Dick den Hertog. This includes some extrapolative versions of Keijzer's interpolative problems. See below for method of sampling training and testing data. Note that the final problem, "Tower", requires input data which is not available.

F(x,y) = 1/(1+power(x,-4)) + 1/(1+pow(y,-4)) ie:

Range: –5 <=x <=5 and -5<=y<=5, normally with test data 0.4 apart, which gives a total of 676 data points. The shape of the graph between these points is quite smooth. This is a hard problem for standard GP -- none of the papers below have been able to get GP to solve this (using a Koza style 676 hits predicate). This is also a contrast to several of Korns' hard problems, above, which rely on high-frequency sinusoids (for example) for difficulty.

The solution is possible using a GP style system (or GE system) with spatial co-evolution or ALPS (age-layered planes) (Hornby).

References:

- Pagie and Hogeweg 1997: Ludo Pagie, Paulien Hogeweg: Evolutionary Consequences of Coevolving Targets. Evolutionary Computation 5(4): 401-418 (1997).

- Melanie Mitchell (2006) Co evolutionary learning with spatially distributed populations. In G.Y. Yen and D.B. Fogel (editors) Computational Intelligence: Principles and Practice. New York: IEEE

- Williams N.L and Mitchell,M (2005): investigating the success of spatial co evolutionary learning. In H.G. Beyer et al. (editors) GECCO 2005, New York: ACM Press 523-530.

- Folkert De Boer, Master’s Thesis, The role of speciation in Spatial Co Evolutionary Function Approximation. Utrecht University 2007

- Robin Harper: Spatial co-evolution in Age Layered Planes (SCALP) IEEE Congress on Evolutionary Computation (CEC 2010), IEEE Press, 18-23 July 2010.

- Robin Harper: Enhancing Grammatical Evolution, PhD thesis.

The latter two papers provide further analysis and metrics on the probability of success. The first paper (Pagie and Hogeweg) also analysed the ability of their system to generalise between the data points and suggested that the function can easily be extended to increase the dimensions (e.g. 1/(1+power(x,-4)) +1/(1+power(y,-4)) + 1/(1+power(z,-4))).

Real-world symbolic regression has been one of GP's best success stories.

In real-world problems, in contrast to symbolic regression of synthetic data, there is no ground truth. Validation of results is therefore non-trivial. Not only the error on the test data set matters, but also generalization, model size, model dimensionality, numerical stability, robustness against small perturbations of the inputs, etc. This means that domain expertise may be required in evaluating success.

Some climate data-sets are available from the [[National Climatic Data Center Global Surface Summary of Day -> http://www.ncdc.noaa.gov/cgi-bin/res40.pl?page=gsod.html]].

"We run experiments on two wind speed time series, corresponding to the weather stations in the southern Brazilian cities of Curitiba (station code 838400) and Florianopolis (station code 838990), respectively. We use data from the years 2008 and 2009 for training, and test on data covering the first 11 months of 2010. Both series contain missing entries, and we handle that by replicating the previous observation."

Data-sets used here are available from brain-computer interface competitions.

Symbolic regression of data-sets taken from real-world circuit design problems. Link gives paper, slides, benchmark data, and a new technique, *Fast Function Extraction* for attacking these problems.

- Intertwined spirals Koza

Many real-world data-sets are available in [[the machine learning database -> http://archive.ics.uci.edu/ml/]]. Most are classification problems which have been used for years in the machine learning community. Other ML repositories include:

A survey of classification work is [available here].

- DELVE. A collection of datasets, many taken directly from UCI. Also contains some symbolic regression problems.

"Lower bound" benchmarking in support vector machines and neural network time-series and classification: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5596636.

Many bioinformatics problems are classification problems -- see next section.

These problems are realistic and important. But they are relatively slow, non-tunable, and some may require some domain knowledge.

Some data-sets are publically available. The Gene Expression Omnibus database "stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository."

P. Widera, J. Garibaldi, and N. Krasnogor, GPEM.

The Infobiotics PSP benchmarks repository "contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods." [...] "We have generated 140 versions of the same prediction problem. 120 as a classification domains, 20 as a regression domain. Some versions use discrete attributes, other continuous. The number of attributes ranges from 1 to 380. The number of classes from 2 to 5. The datasets are partitioned into training/test sets using 10-fold cross-validation." [...] "All datasets have been encoded using the ARFF format of the WEKA machine learning package."

- Cancer prediction (classification)

Langdon, using the GSE3494 breast cancer dataset. Original dataset available here: [1]. Normalised dataset for use in GP benchmarking available here: [2].

See also "The fibromatosis signature defines a robust stromal response in breast carcinoma", Andrew H Beck, Inigo Espinosa, C Blake Gilks, Matt van de Rijn and Robert B West, *Laboratory Investigation* (2008) 88, 591–601,
doi:10.1038/labinvest.2008.31, [3]

- Human bioavailability of drugs

Silva and Vanneschi, "Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction", GECCO 2009. Data file and header file.

- mRNA

Evolving Regular Expressions as mRNA Motifs to Predict Human Exon Splitting

- Predicting "GeneChip" probe performance

Evolving DNA motifs to Predict GeneChip Probe Performance. "6685 human tissue samples from NCBI's GEO database".

- Classification of proteins into nuclear and non-nuclear. [4]

Natural Computing, Volume 7, Number 4, 589-613, DOI: 10.1007/s11047-007-9038-8

Games can make great benchmarks for several reasons. They introduce dynamic and competitive features, they can be very difficult, and they're fun to visualise. Many competitions already exist in which researchers compete to produce the best AI, machine-learning, or EC-created players for particular games.

- Physical Travelling Salesman -- ongoing competition.

- Mario: [5]. Tunable/variable by choosing different levels. Relatively fast evaluation, for a game.

- Ms PacMan: EvoGames2010. See screen-capture version competition page and team-based pacman v ghosts page.

- Many possible 1-player games, ie puzzles, here

Sipper and colleagues [6] have evolved players or evaluators for the following games:

- Chess -- evolve a board evaluator for endgames

- Backgammon

- Checkers

- Car racing (RARS is now out-of-date -- see TORCS above)

- Rush Hour

- FreeCell

- Trading:

Both artificial ant and lawnmower problems are not "realistic" and were intended to be illustrative, but have been hijacked as benchmarks. More realistic robot control problems have appeared in recent years.

- Artifical Ant (Koza-I). Why ants are hard

Artificial ant is tunable by using alternative trails, including [ http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/WBL.euro98_bloatd.pdf dynamic trails].

- Lawnmower (Koza-II)

The lawnmower was intended to demonstrate a requirement for automatically-defined functions.

- Tartarus problem Cuccu and Gomez -- relatively difficult, large search space. No single standard representation.

- Pole-balancing. Nicolau et al.

- Truck Backer-Upper Koza

Mostly from Koza. Aim is always to evolve a function outputting a particular truth table. Tunable by varying number of inputs (N). Different function sets are traditional for each function, including And-Or-Not, And-Or-Not-If, And-Or-Nand-Nor. Rather misleading in that random trees often get half the fitness cases correct, and many mutations are not phenotype (ie semantics)-neutral but are fitness-neutral, and fitness improvements often come in large chunks, eg 2^m at a time. Typical values for N are small, as shown. Larger instances of these problems, eg 20- and 37-Multiplexer, have been tackled.

- N-Parity (typically N = 4, 5, 6)

- N-Multiplexer (typically N = 3, 6, 11)

- N-Majority (typically N < 10?)

- Boolean problems with multiple outputs

Digital adder and multipliers -- these are Boolean circuits that take two operands (in binary) and output in binary the sum or product. Thus there are 2n + 1 inputs (including a carry) and n + 1 outputs (including a carry), for adders, and 2n outputs for multipliers.

Walker J.A., Völk, K. , Smith, S. L., Miller, J. F. Parallel evolution using multi-chromosome Cartesian genetic programming, Genetic Programming and Evolvable Machines, 10 (4), (2009) pp 417-445

Walker J.A., Miller J.F. The Automatic Acquisition, Evolution and Re-use of Modules in Cartesian Genetic Programming. IEEE Transactions on Evolutionary Computation, 12, (2008) pp 397-417

- General solutions

Solving general Boolean problems, ie for all N, is another more difficult problem.

S. Harding, J. F. Miller, W. Banzhaf. Developments in Cartesian Genetic Programming: Self-modifying CGP, , Genetic Programming and Evolvable Machines, 11 (3/4), (2010) pp397-439

Tina Yu, "Hierarchical Processing for Evolving Recursive and Modular Programs Using Higher-Order Functions and Lambda Abstraction", Genetic Programming and Evolvable Machines, 2, 345–380, 2001

These include problems not intended to benchmark one operator against another or one GP design against another. Instead they are intended for empirical investigations into the nature of how GP works. All (?) are useful only for tree representations. Not intended to provide useful information about behaviour on real-world problems. All (?) are fast, since they are evaluated by inspection, not execution, and do not require multiple executions of a program on multiple fitness cases.

- Order. Only internal function is "join". Tunable -- by changing number of inputs

- Majority. Only internal function is "join". Tunable -- by changing number of inputs

- Lid problem: Daida etal. Only internal function is "join". Intended to demonstrate structural issues.

- Max: Chris Gathercole and Peter Ross. "An adverse interaction between crossover and restricted tree depth in genetic programming", GP I, 1996. Intended to demonstrate a particular difficulty for GP crossover.

- Royal Tree (Punch). Tunable by changing alphabet and target. Analogous to GA Royal Road. Intended to demonstrate structural issues.

- TreeString: Gustafson et al. Not just for "study" -- also intended as a realistic benchmark combining structural and content requirements.

- K Landscapes Vanneschi et al. A family of benchmark problems somewhat similar to and influenced by Royal Tree and Tree-String problems. Tunable with value of K.

A large majority of GP work evolves purely functional programs in which every subtree is of a single type. Real programming involves multiple types, composite types, variables, recursion, iteration, branching, classes, etc.

- Sorting

- Automatic debugging

Automatically finding patches using genetic programming

The inductive programming community maintains a repository of benchmarks with results from several standard systems. Most benchmarks consist of the induction of functions such as list-sorting, extraction of the first element of a list, and similar, though there are also problems like the Towers of Hanoi. Inductive Programming benchmark repository.

Adding a dynamically-varying fitness function, or some other dynamic aspect, can make a problem much more difficult and sometimes much more realistic. Some examples.

- Dynamic Scheduling

Domagoj Jakobovic and Leo Budin, "Dynamic Scheduling with Genetic Programming", EuroGP 2006.

- Intrusion Detection

J. Hansen, P. Lowry, R. Meservy, D. McDonald, "Genetic programming for prevention of cyber-terrorism through dynamic and evolving intrusion detection." Decis. Support Syst. 43(4), 1362–1374

- Time series forecasting

N. Wagner, Z. Michalewicz, M. Khouja, R. McGregor, Time series forecasting for dynamic environments: The dyfor genetic program model. IEEE Trans. Evol. Comput. 11(4), 433–452 (2006)

- Prime number prediction: EuroGP 2007

- Maximum Overhang: Overhang. Fast to evaluate. Easy to understand, optimal solution not known, bounds are known.

- Chaotic Control/discrete dynamical systems: Lones et al