Build journal · Codex × Handshake 2026

How WinSim AI actually got built.

The most useful thing I could build wasn't another product for users. It was a product for the judges. Here's how I got there.

Author

Debi Aggers

Read time

6 min

Total cost

$2.50

Failures

0 / 577

I didn't have an idea.

The hackathon dropped on me with the same panic anyone gets staring at a blank canvas. I had a stack of ideas that all felt mid — a todo app with vibes, another GPT wrapper, a clever-but-niche tool nobody actually asked for. None of it felt like it would cut through hundreds of other submissions, and I knew it.

So I stopped. I went and read the rubric.

The rubric is the brief.

Five dimensions: usefulness, execution, creativity, clarity, usability. Each one a 1–100 score that gets averaged into a weighted total. Usefulness pulls the biggest single weight (25%). Execution ties it (also 25%). Together they're more than half the score.

I sat with that for a minute and asked the obvious question: useful to whom?

If usefulness is what wins, the most useful thing I could possibly build is a tool for the judges themselves.

To the judges.

A real human is about to read 577 project descriptions and assign 5 scores to every one of them. That's nearly three thousand individual judgments, plus rationale, plus comparison, plus a final ranking. They will get tired. They will get inconsistent. They will under-attend the 400th submission relative to the 5th. And yet — they have to.

If usefulness is the lever that wins this hackathon, the most useful thing I could possibly build isn't another product for users. It's a product for the people who actually have to grade 577 of these. Something that does the work nobody wants to do but everyone needs done.

That was the whole insight. Everything else was just wiring.

So I built the judge.

WinSim AI is an AI judge that reads every Codex × Handshake submission, fetches the live site, scores it on the same five-dimension rubric, then re-runs a three-judge panel through 500 Monte Carlo iterations to predict the actual finishing order. The output is a deterministic leaderboard — same seed every visit, no rerun theater — and a chatbot that lets a real judge ask questions in natural language about any of the top 50 entries.

Projects judged

577

Compute cost

$2.50

Wall-clock runtime

27 min

Retry failures

How it actually works.

Three pipelines, one direction.

1. Live-site review

A scripted artifact fetcher hits every project's external URL and captures the HTTP status, page title, meta description, OG image, content length, and flags for login walls or error pages. The AI judge sees this before it scores — so a project with a working URL whose page title actually matches the project name gets credit for shipping. A project whose URL 404s, sits behind a login wall, or returns a stub gets capped on execution and usability regardless of how polished the writeup reads. This one signal does more for fairness than any prompt-engineering tweak.

2. The judge

GPT-4o, anchored to a calibrated rubric prompt with explicit score bands. 95+ is reserved for the top 1–2% of the field. Sub-65 is for clearly weak entries. Explicit caps fire when artifacts are missing — execution can't exceed 75 with no live URL; usability can't exceed 65 if there's no usable artifact at all. The model returns five integer scores and a 2–3 sentence judge's note in strict JSON schema. Temperature 0.2 for stability without total rigidity.

3. The panel

Three judge profiles — Balanced Product, Technical, UX & Novelty — each with slightly different rubric weights and ±2-point noise. The Monte Carlo runs the panel against the field 500 times under a fixed seed (1337). Win probability is the share of runs where a project finishes first. Top-3 rate is the share where it finishes in the top three. Average rank is the average of its rank across all 500 simulated rounds.

Methodology choices that matter.

Why no rerun button?

Because the noise budget is small enough that rerunning is theater — it just lets a user hunt for a result they like. Determinism is honesty. The leaderboard you see is the same leaderboard the next person sees. That's the entire point.

Why GPT-4o and not mini?

Calibration. gpt-4o-mini drifts toward the 80s for almost everything; the field flat- lines at the median and the simulator stops differentiating. GPT-4o uses the full range when the prompt tells it to. The actual distribution: 1% below 50, 27% in the mid-band (50–69), 32% solid (70–79), 28% strong (80–89), 4% exceptional (90+). That shape is what makes the win-probability math meaningful.

Why is my own project not on the leaderboard?

Because it would be a conflict of interest. The whole purpose of WinSim AI is to simulate the judging of other entries; if I included myself in my own simulator, the result would be tainted on its face. So I deliberately excluded it from the field. That's called out in the homepage notice in big letters because it should be the first thing a judge sees.

Determinism is honesty. The leaderboard you see is the same leaderboard the next person sees.

What I'd do with another week.

Push the panel up from the Monte Carlo step into the score step. Run each project through three differently-prompted GPT-4o calls and average their scores. Right now "3 judges" lives in the variance layer; running it at the scoring layer would tighten calibration further.
Calibrate against a hand-scored sample. Have a real human score 30 projects, then compute correlation against the AI scores. Publish the inter-rater agreement number. Without that, "98th percentile" is a precise-sounding claim with no error bar — and that bothers me more than it should.
A live-update path. Right now the simulation only re-runs when projects.json changes. Hooking that to a cron and Codex's own submission feed would make the leaderboard truly live.
A per-project chat that includes the full live-site fetch — not just the metadata summary. The judge model could then quote the actual page when explaining a ranking.

What this is not.

This is not a substitute for human judging. It's a tool to help the people who are doing the judging — by surfacing front-runners, flagging weak submissions, and answering questions in natural language about why a project ranks where it does. The model is wrong sometimes. The amber notice on the homepage isn't false modesty; it's the honest framing of what this thing actually is.

The bet.

If usefulness is 25% of the score, and the judges have to score 577 projects, then the maximum-usefulness submission isn't another product for users. It's a product for the judges themselves.

That was the whole insight. The rest was just wiring.