Here, we classify some common problems according to domain, difficulty, and/or purpose.
This collection of problems does NOT constitute a draft or proposed benchmark suite. It is an incomplete list of problems which have been used in previous work.
I've often dreamed of writing a paper "Death to Quartic Polynomial" because it's such a waste of time, giving worse-than-zero information on GP performance for anything real -- Trent McConaghy, private communication.
Floating-point symbolic regression, polynomials - quartic, quintic, single trigonometric/exponential functions, e.g. from Koza-1, Binomial-N, etc.
Natural Computing, Volume 7, Number 4, 589-613, DOI: 10.1007/s11047-007-9038-8
Key issues include:
Interpolation (relatively easy), extrapolation of existing trends (much more difficult), and extrapolation to new trends (eg the plateau of a logistic function -- impossible?).
The exact form of the target function, as opposed to a close approximation, may depend on having sufficient sampled data, which in turn depends on dimensionality. For low numbers of dimensions, GP can hope to find the solution exactly. For larger numbers of dimensions, close approximation may be possible even if exact matching is not. For very large numbers of dimensions, it may be impossible to do better than determining which dimensions are of relevance.
[Semantically-based Crossover in Genetic Programming: Application to Real-valued Symbolic Regression] Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, R.I. McKay, Edgar Galvan-Lopez, GPEM 2011.
"in current state-of-the-art symbolic regression engines, accuracy is poor"
Accuracy in Symbolic Regression -- GPTP 2011
"Our testing regimen uses only statistical best practices out-of-sample testing techniques. We test each of the test cases on matrices of 10000 rows by 1 to 5 columns with no noise. For each test a training matrix is filled with random numbers between -50 and +50. The test case target expressions are limited to one basis function whose maximum depth is three grammar nodes."
Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming Ekaterina J. Vladislavleva, Member, IEEE, Guido F. Smits, Member, IEEE, and Dick den Hertog. This includes some extrapolative versions of Keijzer's interpolative problems. See below for method of sampling training and testing data. Note that the final problem, "Tower", requires input data which is not available.
F(x,y) = 1/(1+power(x,-4)) + 1/(1+pow(y,-4)) ie:
Range: –5 <=x <=5 and -5<=y<=5, normally with test data 0.4 apart, which gives a total of 676 data points. The shape of the graph between these points is quite smooth. This is a hard problem for standard GP -- none of the papers below have been able to get GP to solve this (using a Koza style 676 hits predicate). This is also a contrast to several of Korns' hard problems, above, which rely on high-frequency sinusoids (for example) for difficulty.
The solution is possible using a GP style system (or GE system) with spatial co-evolution or ALPS (age-layered planes) (Hornby).
The latter two papers provide further analysis and metrics on the probability of success. The first paper (Pagie and Hogeweg) also analysed the ability of their system to generalise between the data points and suggested that the function can easily be extended to increase the dimensions (e.g. 1/(1+power(x,-4)) +1/(1+power(y,-4)) + 1/(1+power(z,-4))).
Real-world symbolic regression has been one of GP's best success stories.
In real-world problems, in contrast to symbolic regression of synthetic data, there is no ground truth. Validation of results is therefore non-trivial. Not only the error on the test data set matters, but also generalization, model size, model dimensionality, numerical stability, robustness against small perturbations of the inputs, etc. This means that domain expertise may be required in evaluating success.
Some climate data-sets are available from the [[National Climatic Data Center Global Surface Summary of Day -> http://www.ncdc.noaa.gov/cgi-bin/res40.pl?page=gsod.html]].
"We run experiments on two wind speed time series, corresponding to the weather stations in the southern Brazilian cities of Curitiba (station code 838400) and Florianopolis (station code 838990), respectively. We use data from the years 2008 and 2009 for training, and test on data covering the first 11 months of 2010. Both series contain missing entries, and we handle that by replicating the previous observation."
Data-sets used here are available from brain-computer interface competitions.
Symbolic regression of data-sets taken from real-world circuit design problems. Link gives paper, slides, benchmark data, and a new technique, Fast Function Extraction for attacking these problems.
Many real-world data-sets are available in [[the machine learning database -> http://archive.ics.uci.edu/ml/]]. Most are classification problems which have been used for years in the machine learning community. Other ML repositories include:
A survey of classification work is [available here].
"Lower bound" benchmarking in support vector machines and neural network time-series and classification: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5596636.
Many bioinformatics problems are classification problems -- see next section.
These problems are realistic and important. But they are relatively slow, non-tunable, and some may require some domain knowledge.
Some data-sets are publically available. The Gene Expression Omnibus database "stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository."
P. Widera, J. Garibaldi, and N. Krasnogor, GPEM.
The Infobiotics PSP benchmarks repository "contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods." [...] "We have generated 140 versions of the same prediction problem. 120 as a classification domains, 20 as a regression domain. Some versions use discrete attributes, other continuous. The number of attributes ranges from 1 to 380. The number of classes from 2 to 5. The datasets are partitioned into training/test sets using 10-fold cross-validation." [...] "All datasets have been encoded using the ARFF format of the WEKA machine learning package."
See also "The fibromatosis signature defines a robust stromal response in breast carcinoma", Andrew H Beck, Inigo Espinosa, C Blake Gilks, Matt van de Rijn and Robert B West, Laboratory Investigation (2008) 88, 591–601, doi:10.1038/labinvest.2008.31, 
Evolving DNA motifs to Predict GeneChip Probe Performance. "6685 human tissue samples from NCBI's GEO database".
Natural Computing, Volume 7, Number 4, 589-613, DOI: 10.1007/s11047-007-9038-8
Games can make great benchmarks for several reasons. They introduce dynamic and competitive features, they can be very difficult, and they're fun to visualise. Many competitions already exist in which researchers compete to produce the best AI, machine-learning, or EC-created players for particular games.
Sipper and colleagues  have evolved players or evaluators for the following games:
Both artificial ant and lawnmower problems are not "realistic" and were intended to be illustrative, but have been hijacked as benchmarks. More realistic robot control problems have appeared in recent years.
Artificial ant is tunable by using alternative trails, including [ http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/WBL.euro98_bloatd.pdf dynamic trails].
The lawnmower was intended to demonstrate a requirement for automatically-defined functions.
Mostly from Koza. Aim is always to evolve a function outputting a particular truth table. Tunable by varying number of inputs (N). Different function sets are traditional for each function, including And-Or-Not, And-Or-Not-If, And-Or-Nand-Nor. Rather misleading in that random trees often get half the fitness cases correct, and many mutations are not phenotype (ie semantics)-neutral but are fitness-neutral, and fitness improvements often come in large chunks, eg 2^m at a time. Typical values for N are small, as shown. Larger instances of these problems, eg 20- and 37-Multiplexer, have been tackled.
Digital adder and multipliers -- these are Boolean circuits that take two operands (in binary) and output in binary the sum or product. Thus there are 2n + 1 inputs (including a carry) and n + 1 outputs (including a carry), for adders, and 2n outputs for multipliers.
Walker J.A., Völk, K. , Smith, S. L., Miller, J. F. Parallel evolution using multi-chromosome Cartesian genetic programming, Genetic Programming and Evolvable Machines, 10 (4), (2009) pp 417-445
Walker J.A., Miller J.F. The Automatic Acquisition, Evolution and Re-use of Modules in Cartesian Genetic Programming. IEEE Transactions on Evolutionary Computation, 12, (2008) pp 397-417
Solving general Boolean problems, ie for all N, is another more difficult problem.
S. Harding, J. F. Miller, W. Banzhaf. Developments in Cartesian Genetic Programming: Self-modifying CGP, , Genetic Programming and Evolvable Machines, 11 (3/4), (2010) pp397-439
Tina Yu, "Hierarchical Processing for Evolving Recursive and Modular Programs Using Higher-Order Functions and Lambda Abstraction", Genetic Programming and Evolvable Machines, 2, 345–380, 2001
These include problems not intended to benchmark one operator against another or one GP design against another. Instead they are intended for empirical investigations into the nature of how GP works. All (?) are useful only for tree representations. Not intended to provide useful information about behaviour on real-world problems. All (?) are fast, since they are evaluated by inspection, not execution, and do not require multiple executions of a program on multiple fitness cases.
A large majority of GP work evolves purely functional programs in which every subtree is of a single type. Real programming involves multiple types, composite types, variables, recursion, iteration, branching, classes, etc.
The inductive programming community maintains a repository of benchmarks with results from several standard systems. Most benchmarks consist of the induction of functions such as list-sorting, extraction of the first element of a list, and similar, though there are also problems like the Towers of Hanoi. Inductive Programming benchmark repository.
Adding a dynamically-varying fitness function, or some other dynamic aspect, can make a problem much more difficult and sometimes much more realistic. Some examples.
Domagoj Jakobovic and Leo Budin, "Dynamic Scheduling with Genetic Programming", EuroGP 2006.
J. Hansen, P. Lowry, R. Meservy, D. McDonald, "Genetic programming for prevention of cyber-terrorism through dynamic and evolving intrusion detection." Decis. Support Syst. 43(4), 1362–1374
N. Wagner, Z. Michalewicz, M. Khouja, R. McGregor, Time series forecasting for dynamic environments: The dyfor genetic program model. IEEE Trans. Evol. Comput. 11(4), 433–452 (2006)