Wikipedia: Benchmark

A computer program or programs designed to test and profile one or more aspects of a CPU (such as floating point operations) or computer system (such as graphics capabilities).

Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures.

As computer architecture advanced, it became more and more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that could be performed on different systems, allowing the results from these tests to be compared across different architectures.

Benchmarks are designed to mimic a particular type of workload on a component or system. "Synthetic" benchmarks do this by specially-created programs that impose the workload on the component. "Application" benchmarks, instead, run actual real-world programs on the system. Whilst application benchmarks usually give a much better idea of real system performance than synthetic benchmarks, they tend to reflect all aspects of a system rather than individual parts of it, so synthetic benchmarks can be useful for evaluation of, say, a hard disk or networking device.

Computer manufacturers have a long history of trying to set up their systems to give unrealistically high performance on benchmark tests that is not replicated in real usage. For instance, during the 1980's some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a mathematically-equivalent operation that was much faster. However, such a transformation was rarely useful outside the benchmark.

More generally, users are recommended to take benchmarks, particularly those provided by manufacturers themselves, with ample quantities of salt. If performance is really critical, the only benchmark that matters is the actual workload that the system is to be used for. If that is not possible, benchmarks that resemble real workloads as closely as possible should be used, and even then used with scepticism. It is quite possible for system A to outperform system B when running program furble on workload X (the workload in the benchmark), and the order to be reversed with the same program on your own workload.

Some common benchmarks are:

Dhrystone
Whetstone?
SPEC?
BAPco?
3DMark?
Quake
Khornerstone