Criteria

From GPBenchmarks
Jump to: navigation, search

Benchmark Criteria

The GP-Beagle paper mentions several primary desiderata for benchmarks.

"The essential requirements on a benchmark are [...]:

  • Volume (the benchmark should include several and diverse problems)
  • Validity (common errors that invalidate the results should be avoided)
  • Reproducibility (problems and experiments should be documented well enough to be reproducible)
  • Comparability (results should be comparable with results from other studies) [...]
  • Representability (benchmark problems should resemble the problems met in the real world."

In the GP mailing list discussions, a long list of criteria for new benchmarks were mentioned, either explicitly or implicitly, many overlapping the GP-Beagle criteria in some way. They include:

  • Nontrivial
  • Fast
  • Easy to write
  • Portable across architecture/languages
  • Spread of possible fitness values
  • Multiobjective
  • Amenable to different GP representations
  • Tunable difficulty (including relative size of solution set, deception, size of input data...)
  • Readability of results
  • Performance should reflect/predict performance on "real-world" problems
  • Existing implementations
  • Possibility of comparing against others' results, both within GP and using other methodologies
  • Little/no requirement to acquire expertise in specialised problem domain
  • Free availability of any required datasets
  • Not vulnerable to paradigm shifts in underlying problem domain (cf bioinformatics)
  • Easy statistical treatment of results (cf "hits").

In "Accelerating progress in Artificial General Intelligence: Choosing a benchmark for natural world interaction", Brandon Rohrer writes that a good benchmark should have these characteristics (Journal of Artificial General Intelligence 2(1) 1-28, 2010):

  • "Fitness Success on the benchmark solves the right problem.
  • Breadth Success on the benchmark requires breadth of problem solving ability
  • Specificity The benchmark produces a quantitative evaluation.
  • Low Cost The benchmark is inexpensive to evaluate.
  • Simplicity The benchmark is straightforward to describe.
  • Range The benchmark may be applied to both primitive and advanced systems.
  • Task Focus The benchmark is based on the performance of a task."
Personal tools