Skip to content

Scoring

A score here is a calibrated indicator that a capacity is present — never the capacity itself, and never the thing being optimised. A self-organising collective maintains its own viability; it has no intrinsic reason to play Pong. That it can be made to is what we measure — and a bare number means nothing until you say what it is measured against. So each task’s raw metric is mapped onto [0,1][0,1] against two anchors, a floor and a ceiling:

normalised=clamp ⁣(rawfloorceilingfloor,  0,  1)\text{normalised} = \operatorname{clamp}\!\left(\frac{\text{raw} - \text{floor}}{\text{ceiling} - \text{floor}},\; 0,\; 1\right)

Read it as the fraction of the gap from chance to the reference that this collective closed. Zero is not “did nothing” — it is “no better than the null”.

Anchors carry their provenance

ScoreAnchor replaces the hand-typed magic numbers. Each anchor is a value, a kind, and a provenance string, so no floor or ceiling is a number someone once eyeballed:

  • ANALYTIC — a principled bound: tracking’s chance level E[cosθ]=0\mathbb{E}[\cos\theta]=0, or a true maximum such as “intercept every ball” (hit_rate=1\text{hit\_rate}=1).
  • NULL_MEASURED — a measured chance floor: the score of a null policy, run, not guessed.
  • REFERENCE_MEASURED — a measured ceiling from a named reference agent.

Source: src/tasks/Scoring.jl (ScoreAnchor, normalized_score), src/tasks/Tasks.jl (the per-task anchors).

The null is model-agnostic

The floor is the score of :null_random — a policy that ignores its input and emits uniform-random effectors. Model-agnostic on purpose: a sweep varies the node type, and a “dead reservoir” null would mean something different for each, whereas “no better than acting at random” is the same zero for every model. calibrate_task runs it over seeds and records the floor with provenance, e.g. null=null_random, score_key=forage_score, seeds 0:7, git ….

This is why several floors sit well above zero: a random forager still drifts 0.46\approx 0.46 of the way to the source just by wandering, so a forage score only counts once agents beat undirected search.

Source: src/nodes/NullRandom.jl, src/tasks/Calibration.jl (calibrate_task).

Gates keep the indicator honest

An indicator is only trustworthy if a degenerate solution can’t game it. The wall task rewards collision avoidance — but “avoid every collision” is trivially won by not moving. So the wall score is collision-free navigation gated by movement: a frozen agent scores 0\approx 0 because the gate — not the score — zeroes it. The gate encodes the unstated half of the task (“while behaving”); it is never the competence signal. (The same pattern gates rhythm tasks on oscillation amplitude.)

Descriptors are read alongside a score, never as one

Each task names one score_key (the capacity indicator) plus any descriptor_keys — channels reported next to it but never treated as competence. Collective order parameters such as polarization and milling are descriptors in exactly this sense: they measure how a swarm is organised, not whether it succeeded, and reporting one as a “score” would confuse a quality with a capacity.

Where the anchors stand today

Every current task is a null-measured floor + an analytic ceiling — the honest configuration when no good trained agent yet exists to serve as a reference:

taskfloorceiling
wallnull ≈ 0.76 (nav_score)analytic 1.0 — collision-free navigation
trackinganalytic 0.0 — chanceanalytic 1.0
pong / pong_hitratenull ≈ 0.36 (hit_rate)analytic 1.0 — intercept every ball
cartpole (+ swingup)analytic 0.0 / null ≈ 0.16analytic 1.0
foragenull ≈ 0.46analytic 1.0 — agents on source

A REFERENCE_MEASURED ceiling — a genuinely good, trained agent as the “1” — is a TODO(reference-genome): until such an agent exists, the analytic maximum is the honest ceiling. Scores are mutually comparable only within a ceiling kind — analytic-ceiling tasks share “1 = true optimum”; a future reference-ceiling task would be comparable only relative to its own reference.

Source: src/run/Sweep.jl (_sim_score, descriptor columns), src/tasks/Tasks.jl.