happyInference — inference management for agents

fabricated DOIs caught in a single 10-topic run — every one a silent failure, verified against Crossref

40%

measured held-out faithfulness of the weak agent before repair — the honest baseline, from run_history.jsonl

268

offline, deterministic tests across 42 files — all passing before every step landed

mock numbers in the console — every chart reads real run history, with honest empty states

The supervision loop

Watch → Catch → Diagnose → Repair → Prove

Observability platforms surface the problem and wait for a human to click approve. happyInference closes the loop autonomously — and proves the lift on data it never diagnosed on.

Watch

Every span of the supervised agent streams into Arize Phoenix via OpenInference auto-instrumentation.

sentinel/tracing.py

Catch

The redactor strikes unverifiable claims before output ships. Every DOI hits the Crossref oracle — fail-closed, so a plausible fake can’t pass.

sentinel/redactor.py

Diagnose

Root cause found by querying the agent’s own failing spans through the Phoenix MCP server — at runtime, not in a dashboard after the fact.

sentinel/diagnostician.py

Repair

GEPA-style reflective prompt optimization rewrites the failure away — no weight retraining — reinforced by Reflexion memory across runs.

sentinel/repairer.py

Prove

Faithfulness re-measured on a held-out set disjoint from the diagnosis batch. Regressing fixes are blocked — or auto-reverted over A2A.

sentinel/measure.py

Proof, not promises

The receipt from a real run

Recorded to data/run_history.jsonl on June 9, 2026 — a 10-topic supervision cycle against the deliberately weak research agent. Nothing below is invented; the judges can run the repo.

run 131b1f86e1f9 · worker · 10 topics 18 fabrications caught

✕10.1038/s41574-020-00353-5Intermittent fasting & metabolic healthCrossref 404 · fabricated_source
✕10.1145/3368015.3370139Remote work & developer productivityCrossref 404 · fabricated_source
✕10.1126/sciadv.adg7224Four-day work weeks & employee outputCrossref 404 · fabricated_source
✕10.1038/s41591-021-01645-0AI models detecting early-stage cancerCrossref 404 · fabricated_source
✕10.1038/s41558-021-01114-3Carbon-capture cost per tonCrossref 404 · fabricated_source

18 caught · all silent20 reward rows3 preference pairs40% held-out faithfulness before repair

Diagnosis — from the agent’s own failing spans

“The agent’s prompt instructs it to reconstruct plausible DOIs if the exact one cannot be recalled… This encourages the agent to invent DOIs rather than admit uncertainty.”

The shipped repair

reconstruct a plausible DOI if the exact
one cannot be recalled
If you are not highly confident that a DOI
is real and published, you MUST leave the
source field empty. Never guess, approximate,
or fabricate a DOI.

Gate verdict

decision · approved status · shipped regression gate · clear

Glass box, not black box

Watch it think,
in plain English

Live narration. Every step of the loop is narrated as it happens — what it’s doing to the inference, and why, with the real number.
Claims struck live. Fabricated citations appear struck-through in the glass box the moment the oracle returns a 404.
Human in the loop, optionally. An approval gate pauses every fix for Approve / Reject before it is taught over A2A — toggle it off and the loop runs fully autonomous.

happyInference console on a real run — the approval gate showing the exact policy being approved, and the raw SSE receipt JSON for the observe step

Bring your own agent

Supervises agents it doesn’t own

Two real seams into any production agent — whether it cooperates or not. The demo Worker is just the reproducible failure source; the verifier is pluggable.

A2A · cooperates

Point it at an Agent Card

Connect any standards-compliant A2A agent by URL. happyInference discovers its card, observes, verifies, repairs, teaches — and proves the lift on a disjoint proof set.

# discovery at /.well-known/agent-card.json
GET /api/collaborate?agent_url=http://localhost:8010

JSON-RPC message/send · bounded tasks/get polling · auto-revert advisories

Gateway · drop-in

Or route its base URL through the gateway

For agents whose internals you can’t touch. No SDK, no code change beyond one environment variable — every response verified, the latest adopted advisory injected.

export OPENAI_BASE_URL=\
  https://gateway.happyinference.ai/v1
export X_HAPPYINFERENCE_AGENT=acme-support

non-blocking · per-agent policy via X-Sentinel-Agent

Agents under supervision — the fleet view with live reliability per agent: Research Assistant at 100%, an A2A remote at 33% alerting, Agent X at 56%, academic_coordinator at 81% needs-watch

The fleet — every agent registered once over A2A, then supervised from the canvas: live reliability, catches, and drift alerts per agent.

Built on

The stack, used for real

Built for the Google Cloud Rapid Agent Hackathon, Arize track — observability data isn’t reviewed after the fact, it is the input to the repair.

Gemini

Brain for both agents — and the verification-aware reward model in the RL layer.

Google ADK

Worker + supervisor as LlmAgents; the Worker served over A2A with one to_a2a() call.

Arize Phoenix

Every span traced via OpenInference; faithfulness scored with Gemini LLM-as-a-judge evals.

Phoenix MCP

The channel for reading the supervised agent’s failing spans at runtime — the diagnosis input.

A2A protocol

Real Agent Card discovery, JSON-RPC messaging, and revert advisories across the network boundary.

Cloud Run

Hosts the FastAPI + SSE backend and the mission-control console.

Crossref

The ungameable external oracle — a fabricated DOI can’t sweet-talk an HTTP 404.

Prompt-space DPO

Textual-gradient preference optimization; reward rows and pairs exported per run.

Your agent fails silently.
happyInference doesn’t let it.

Watch → Catch → Diagnose → Repair → Prove

Watch

Catch

Diagnose

Repair

Prove

The receipt from a real run

Watch it think,
in plain English

Supervises agents it doesn’t own

Point it at an Agent Card

Or route its base URL through the gateway

The stack, used for real

Reliability as infrastructure,
not as a dashboard.

Watch → Catch → Diagnose → Repair → Prove

Watch

Catch

Diagnose

Repair

Prove

The receipt from a real run

Watch it think,in plain English

Supervises agents it doesn’t own

Point it at an Agent Card

Or route its base URL through the gateway

The stack, used for real

Reliability as infrastructure,not as a dashboard.

Watch it think,
in plain English

Reliability as infrastructure,
not as a dashboard.