✅ Testing & Quality

Eval-driven dev, mutation testing, visual regression, proving quality.

Repoopenclaw

openclaw/crabbox

A CLI that leases cloud/self-hosted compute, syncs your local diff, and runs commands/tests remotely.

~950 stars; a Go CLI plus a TypeScript coordinator.
40+ infrastructure providers supported.
Core loop: warm a box, sync the diff, run the suite — coordinator deployable on Cloudflare Workers.

added by Adam Tomat • 24th Jun 2026

BlogDecoding AI • Alejandro Aboy

A pre-merge evaluation gate for AI agent changes, using simulated inputs run through the real agent.

An offline gate answering "does it work / did anything regress".
Simulate inputs (drawn from real traces), not outputs.
Rejects always-on prod eval as too costly; runs targeted branch experiments with calibrated binary judges.

added by Adam Tomat • 23rd Jun 2026

BlogAddy Osmani

Reframes the SDLC: agents are mostly harness, and verification moves to the centre.

added by Adam Tomat • 23rd Jun 2026

BlogGarrett Lord

Argues that private evals built from workflow plus domain judgement become durable competitive IP.

Evals encode hard-won domain knowledge competitors can't easily copy.
They turn "is the AI good at our job?" into a measurable, owned asset.
The next era of AI advantage is defined by proprietary evaluation, not just models.

added by Adam Tomat • 22nd Jun 2026

BlogMujahid Abbas

A practical guide to building AI-output evals (distinct from unit tests) in a Laravel app.

Tests check "did it run"; evals check "is it good".
Golden datasets for deterministic outputs, an LLM-judge for open-ended ones.
Run on a schedule / before prompt changes, sampling prod traffic to catch model drift.

added by Adam Tomat • 18th Jun 2026

RepoPaul Duvall