Main Page

From GPBenchmarks
Jump to: navigation, search

Note: some of the content of this wiki is made obsolete by the GP Benchmarks paper. It is not under active editing while the 2012 community survey is underway. Please email to ask for editing access.



"I think GP has a toy problem problem." -- Sean Luke.

In early March 2011, the Genetic Programming mailing list saw some traffic on a "Call for Hard Benchmark Problems in Genetic Programming" (archived here and here).

There seems to be a consensus that many GP benchmarks are too much like toy problems, that problems where we expect to find the optimum in most runs are unrealistic, and that new standard benchmarks are needed. Quite a few senior researchers contributed ideas and comments. This page is intended to track the issue and provide a repository for existing and new benchmarks.

The problem of benchmarking has received recognition in the published literature both old and new. The 2010 GPEM 10-year anniversary special issue featured discussion by O'Neill et al of benchmark problems:

"Issue: Is it possible to define a set of test problems that can be rigorously evaluated by the scientific community and then accepted as a more or less agreed upon set of benchmarks for experimental studies in GP? Difficulty: Medium to high. Benchmark problems serve as a common language for a community when developing algorithms, and therefore a certain amount of inertia to not adopt better ones exists. Also, since GP is already a fairly complex algorithm, adopting overly simple benchmark problems enables overall lower complexity in the system, albeit at the expense of the significance of results. Other fields of Evolutionary Computation, like GAs or PSO, have a wide set of generally studied and agreed open benchmark functions." -- from "Open issues in genetic programming", O'Neill, Vanneschi, Gustafson, and Banzhaf, GPEM 2010.

The issues have been discussed outside the field of EC also. In the field now known as Artificial General Intelligence, the Turing Test is a very well-known "benchmark", though it is controversial and lacks several desirable characteristics. It does not serve as a way to compare progress, but as a focus for discussion. In order to be more useful as a benchmark, it was in a sense operationalised as the Loebner Prize.

Many fields outside EC maintain benchmark repositories. A list of repositories covering many areas of computational intelligence is available here. (The two links to EC repositories are out-of-date.)

Going back to 2000, we find recognition of the major issues and an attempt to address them. GP-Beagle is a precursor of the present effort, in that the aim was to collect information on standard benchmarks in an online repository. GP-Beagle was a database consisting of problems, problem instances, benchmarking suites, and usage information. It was open to the addition of new problems, as is this wiki page. The distinction between problems and problem instances was important: the former are general entities such as a symbolic regression over a particular data set; the latter is more specific, including all the information required for reproducibility such as training/testing splits, evolutionary parameters, and so on. For each problem, a collection of attributes were saved in the database together with zipped data files as necessary and references to papers using the problems.

GP-Beagle was ambitious in several ways. It required significant extra work of those authors choosing to submit benchmarks to the repository. It defined a fixed template for the problems and problem instances to be added to the repository. It is easy to see the advantages of this ambition, if realised: a standardised set of problems would be an invaluable resource. Being machine-readable, the database would open up the possibility of automated testing.

Unfortunately, GP-Beagle did not live up to its potential, and the website (originally [1]) does not now exist. The database itself seems not to exist even in backup form (according to the author, Robert Feldt, who we thank; we would welcome news from anyone who has saved a copy of the website or database). This outcome may be partly the result of GP-Beagle's ambitious nature. Regardless of the existence of any repository, many or most authors will continue to work with their existing code without submitting it to a repository. Nevertheless, GP-Beagle was a valuable contribution in setting out the issues and some new ideas. The proposal to establish links between the benchmark repository and the GP Bibliography is a good one. For now, however, we have chosen to establish our repository as a work-in-progress wiki page, without any standardisation of the format of benchmark problems or machine-readability.

Another significant effort was made by Sendhoff, Roberts, and Yao: the Evolutionary Computation Benchmarking Repository, aka EvoCoBM. See this paper. This project included implementations of problems. Unfortunately, this project suffered the same fate as GP-Beagle. It was hosted here, but was not greatly frequented, and this link is now dead.

Finally, we recall that benchmarks are not a be-all and end-all. Heavy use of standard benchmarks did drive progress in some sub-fields of machine learning for some years (see the GP-Beagle paper for some discussion in the case of neural networks, for example). But marginal improvements on benchmarks are not a substitute for game-changing ideas and novel fields of application for GP.

Taking action


Arguments against a standardised benchmark suite

Problem Classification


Recommended problems


Personal tools