§ 2.4  ·  Research area

Approximate evaluation

CRESCENT and successors — cheap-then-full evaluation gating that redirects compute toward the candidates that deserve a full-fidelity score.

§ 1   Why approximate evaluation

The compute the search wastes is structural, not incidental

At cluster scale, the time spent inside the evaluation function dwarfs the time spent inside selection, mutation, and even the refinement pipeline. The bigger the dataset, the more dominant evaluation becomes — for a non-trivial regression target the loss is computed across every training row of every candidate of every generation.

Most of those evaluations are wasted. The vast majority of candidates the search produces are bad — the search is supposed to produce many bad candidates so the rare good ones can stand out. Spending a full-fidelity evaluation budget on candidates that will be discarded after one generation is a structural inefficiency, not a tuning issue. CRESCENT and its successors treat the question of which candidates deserve a full score as a search problem in its own right.

§ 2   Sub-areas

Four threads of work

§ 2.1

Two-stage evaluation gating

  • Cheap test first — every candidate runs through a low-fidelity evaluation before it can earn a full one. Candidates that fail the cheap test are scored on the cheap test alone and never pay the full cost. Candidates that pass it pay the full cost and the full score replaces the cheap one for downstream use.
  • One pass, one decision — the gate is a single decision, not an iterative racing protocol. This keeps the policy simple to reason about and makes the cost model easy to bound.
  • Open questions — when, if ever, to add a third tier between cheap and full; how to handle candidates that are borderline at the cheap stage; whether the full-stage score should retroactively recalibrate the cheap test for similar future candidates.
§ 2.2

Cheap-test design

  • Binning over data points — group training rows into bins, evaluate on a representative per bin, fold the bin-level losses up. The first CRESCENT prerequisite shipped this binning machinery; it is the foundation under the full gating policy.
  • Sampling — score a randomly sampled subset of the training data. Cheap, statistically well-behaved, but loses any signal carried by infrequent rows.
  • Coarsened loss — keep all rows but compute a less-precise statistic (a quantile rather than a mean, an integer histogram bin rather than a continuous residual). Different failure modes than sampling; complementary, not overlapping.
  • What "good cheap" means — the test must correlate with the full-fidelity score on candidates that matter, while running fast enough to pay for itself. Correlation on bad candidates is irrelevant; the gate only needs to be informative on the population's surviving tail.
  • Open questions — automatic cheap-test selection per dataset; adaptive bin count under shifting data distributions; non-i.i.d. binning when the data has explicit structure.
§ 2.3

Dynamic thresholds

  • Static thresholds underweight progress — the population's quality distribution shifts as the run progresses. A threshold that was well-calibrated in generation 10 is too lax in generation 100; most candidates pass the gate, the cheap test stops gating anything, and the speedup collapses.
  • Population-aware gating — threshold derived from current population statistics — a quantile of recent cheap-test scores, possibly with a slack term. The gate moves with the run; selection pressure stays effective.
  • Feedback-loop concerns — when the threshold depends on the cheap test and the cheap test has a noise floor, a too-tight threshold creates an oscillating gate that lets candidates through in batches. Smoothing helps; over-smoothing reintroduces the static problem.
  • Open questions — slack-term policy under different population sizes; threshold reset on strategy alternation; feedback damping in the absence of strong signal.
§ 2.4

Variable-cost evaluation

  • Cost is candidate-dependent — a tall, expensive expression takes substantially longer to evaluate than a small one. Treating evaluation cost as constant across the population — as racing-style literature usually does — overcounts the cost of small candidates and undercounts the cost of large ones.
  • Cost-aware gating — the gate decision should weigh expected savings against expected risk: if the full evaluation of a candidate would have been cheap anyway, the gate is paying setup cost for little gain. If it would have been expensive, even a noisy gate is a win.
  • Cost prediction from structure — candidate cost is mostly a function of structure (depth, function-node count, mode of dependence on the data). A cheap structural cost predictor lets the gate adapt without running the full eval first.
  • Open questions — interaction with constant optimisation (which dominates eval cost for many candidates); cost-prediction calibration drift across datasets; gate behaviour when the variance in candidate cost is itself the signal.
§ 3   Why it matters

Three consequences of this design

Compute follows qualityMost of the per-generation budget routes to the candidates that have already cleared one bar. Bad candidates pay their cheap cost and disappear; the rare good ones get the full attention.
Scaling reshapesThe savings grow with dataset size. As the full evaluation gets more expensive, the cheap test stays cheap, and the gate's value scales faster than linearly in the dimension that matters most for big-data targets.
Cost is decoupled from cardinalityA larger population is no longer quadratically more expensive. The gate's marginal cost per candidate is small; the full eval's marginal cost dominates only on candidates that have already cleared the gate.
§ 4   Notes & references

Related material

The narrative companion to this area is the note Approximate evaluation, as a research problem. CRESCENT is a paper candidate; its first MR (binning prerequisite) shipped, with SRBench harness work and ablations the next steps.

Adjacent areas: Constant optimization § 2.3 shares the cost story — most of the time inside one full evaluation goes to constant optimisation, so a cost-aware gate has to be aware of which optimiser ran; Search strategies § 2.6 describes the dual diversity gates that the gate's output feeds. The pipeline view is described under Technology § 6 with Figure 3.

Read

Continue reading

Other research areas are listed under Research. Published work lives on Papers; short-form notes on Notes.

Back to Research areas