Quantum Gap AI
Back to all posts

How to read a quantum benchmark: Quantum Volume, XEB, and what they actually measure

June 6, 2026·9 min read

Every quantum hardware vendor publishes benchmark numbers. Most are context-free — "we hit Quantum Volume 1024!" is meaningless without knowing what that means. This piece is a sceptic's field guide to the five benchmarks Quantum Gap AI ships, what each one actually measures, and how to use them to compare two backends honestly.

1. Quantum Volume (QV)

IBM's headline metric. What it answers: what's the biggest "square" circuit (equal depth and width) this device can execute well enough that its output is statistically distinguishable from random noise?

How: generate random circuits of size n×n. Run them. Compare the measured distribution to the ideal one via a statistical test (heavy-output probability > 2/3). The largest n for which the test passes is your "depth"; QV = 2n.

What it doesn't tell you: performance on any specific algorithm. QV is a circuit-class average. A device with QV 1024 might be perfect for chemistry and terrible for routing problems, or vice versa. QV is the right metric for comparing two general-purpose devices; it's the wrong metric for predicting how well your specific tool will run.

2. Cross-Entropy Benchmarking (XEB)

Google's metric. What it answers: how close is the measured output distribution to the ideal one, summed over many random circuits?

How: generate random circuits; run them; calculate the cross-entropy between the measured distribution and the ideal one. Average over many circuits.

Use it for: claims of "quantum supremacy" in the original Google sense (a specific XEB score that classical computers would take impossibly long to verify). For day-to-day device comparison, QV is easier to interpret.

3. Mirror Benchmarking

The cleanest single-circuit test. What it answers: can the device execute a random circuit AND its inverse and end up back in the initial state?

How: apply a random Clifford+T circuit, then apply its mathematical inverse. The ideal output is the initial state (e.g., |0…0⟩) with 100% probability. Measured deviation from that is pure device noise.

Why it matters: Mirror is the most "algorithmic" single-shot benchmark we have. The result is calibrated against a guaranteed-correct ideal. If a backend's Mirror score drops 5% overnight, something physical has changed.

4. Randomized Benchmarking (RB)

The oldest and most boring benchmark. What it answers: what's the average error per Clifford gate?

How: apply random Clifford sequences of increasing length, then their inverse. Plot the survival probability vs sequence length. Fit an exponential decay. The decay rate gives you average Clifford gate fidelity.

Use it for: directly comparing single-qubit and two-qubit gate fidelities across devices. The single number you'll read on every hardware vendor's spec sheet ("99.5% two-qubit fidelity") came from RB.

5. Iterative Phase Estimation (IPE)

The algorithmic benchmark. What it answers: can the device read out a known phase to a given number of bits of precision?

How: prepare a known eigenstate of a known unitary; do IPE on it; check how many bits of precision the measured phase matches the true phase. Each bit you correctly extract is a circuit depth of ~2× longer than the previous one, so this benchmark aggressively rewards coherence.

How to use these on Quantum Gap AI

All five are tools in the catalog. The right way to use them:

  • Before you trust any other result — run QV + Mirror on the same day as your algorithm. If they're worse than yesterday, your algorithm's results are noisier than yesterday's would have been.
  • Comparing two backends — RB the gates, Mirror-benchmark a sample circuit, IPE for coherence. Three different lenses on the same device.
  • Telling a procurement story — QV is the right single number for non-technical readers ("the backend executed a circuit of depth log₂(QV)").

The honest version

No benchmark perfectly predicts how your specific algorithm will run. The point of running benchmarks isn't certainty — it's getting context. If your routing algorithm's success drops by 30%, was it your circuit's fault or the backend's? Run the benchmarks. If they dropped too, the backend had a bad day. If they didn't, you have a bug. That's the use.

All five benchmarks live in the catalog at quantum-gap.com. Simulator runs are free; hardware runs are $5/QPU-second and the calibration data IBM publishes daily is bundled into every audit report.

Try the tools.

Simulator runs are free. Hardware runs are $5/QPU-second and never expire.

Get started free