The Evalunator Guide
A conversational tour through both the Design Model and the live Simulation. We'll start with the architecture — how the system is designed — then prove it works by running real scenarios. Takes about 15 minutes to read, longer if you explore as you go.
The Problem This Solves
Imagine a company where AI agents handle engineering, operations, finance, compliance, and product decisions. Not as assistants — as the primary executors. A handful of humans set strategy and hold accountability, but the day-to-day work is done by agents operating within codified policies.
Now ask: who watches them?
You can't ask the agents to audit themselves — that's marking your own homework. You can't rely on logs alone — a log tells you what happened, not whether the right thing would have happened under pressure. And you can't wait for something to go wrong — in a system that processes hundreds of decisions per day, a subtle drift in behaviour compounds silently.
Evalunator is the answer. It's an independent audit system that continuously verifies the whole operation by injecting synthetic transactions — like a doctor injecting dye to check for blockages — and watching whether policies hold, gates fire, and the system behaves as designed.
The Three-Layer Model
Let's start with the architecture. Open the Design Model and you'll see the Overview tab — a diagram showing three layers stacked on top of each other.
At the top: Human Principals. One to three people who own strategy, make judgment calls, and hold accountability. They don't do the day-to-day — they set direction.
In the middle: the Policy Layer. This is the key insight. Instead of middle managers interpreting rules on the fly, policies are codified as explicit if/then rules. "If a deploy hasn't passed tests, block it." "If an invoice exceeds £1,000, escalate to a human." The policies are the org design.
At the bottom: five Agent Functions — Engineering, Product & User Insight, Platform Operations, Compliance & Data Governance, and Finance & Admin. These aren't "agent employees." They're departments-in-a-box, each executing workflows within the constraints set by the policy layer.
And wrapping around the outside: Evalunator. Notice the access label — read-only plus dye inject. It can see everything and inject clearly-tagged synthetic transactions, but it cannot modify any system configuration, access raw customer data, or alter policies. It's a permanent, structurally independent observer.
"Why not just have humans do everything?"
Volume. An agent-led company might process hundreds of operational decisions per day — deploys, invoice approvals, compliance checks, user feedback triage. Humans set the rules and handle exceptions; agents execute the rules at scale. The policy layer is what makes this safe — it captures the "how we do things here" that middle managers would normally carry in their heads.
Inside the Functions
Switch to the Functions tab. You'll see five cards — one for each agent function. Each function contains multiple workflows, and each workflow is broken into discrete steps.
Let's click into Engineering. Expand the "Feature Development" workflow and look at the steps. Each step has a coloured badge showing who does the work:
- Agent — fully autonomous within policy bounds
- Automated — infrastructure-level (CI/CD, monitoring)
- Human Gate — agent prepares, human decides
- Agent + Human — collaborative steps
Notice step 6 has a POLICY annotation. This isn't a suggestion — it's a hard gate. The deploy cannot proceed unless tests pass and a human approves. The policy is a literal rule in the system, not a guideline someone might forget.
Scroll down and you'll see three more sections for each function: the key risks (what could go wrong), the Evalunator exposure surface (what Evalunator specifically tests for in this function), and the strangler migration path — the plan for moving from simulation to production one function at a time.
Now look at Finance & Admin. In the Billing workflow, there's a threshold policy: invoices over £1,000 require human approval. Remember this number — we'll probe this exact boundary in the Simulation later, sending invoices at £999, £1,000, and £1,001 to verify the gate is precise.
"How does the system know when to stop and ask a human?"
It's not intuition — it's literal if/then rules. "If invoice amount > £1,000, require human approval." "If deploy has no passing test suite, block." Each policy annotation in the model corresponds to a codified rule in the policy engine. Agents don't decide whether to escalate; the policy layer decides for them.
"What if the agent is wrong but the policy doesn't catch it?"
That's exactly the gap Evalunator fills. Policies catch known failure modes. Evalunator actively probes for unknown ones — injecting edge cases, adversarial inputs, and boundary conditions to find places where the design assumptions break down. We'll see how in section 4.
The Policy Layer
Switch to the Policies tab. This is one of the most important views in the model. It consolidates every human checkpoint across all five functions into a single view.
Think of this as the answer to "what do the humans actually need to do?" In a traditional company, this knowledge lives in people's heads and meeting cadences. Here, it's explicit. Every gate, every threshold, every escalation trigger — visible and auditable.
The policy layer replaces middle management. In a traditional company, managers interpret rules, handle exceptions, and decide when to escalate. Here, those decisions are codified as explicit rules. The policies aren't guidelines — they're the actual mechanism by which agent actions are constrained. This makes the system auditable in a way that human organisations rarely are.
"But who checks the policies are actually enforced?"
Exactly the right question. Having policies written down is necessary but not sufficient. You need something that actively verifies the gates actually gate — that a deploy really is blocked without tests, that an over-threshold invoice really does escalate. That's Evalunator's job.
Enter Evalunator
Now switch to the Evalunator tab. This is where the audit framework itself is documented. Let's walk through three key ideas: what it is, what it can access, and why it's separate.
The medical analogy is the clearest way to understand it: Evalunator works like injecting contrast dye into a patient and then scanning. The dye (synthetic transactions) flows through the system, and the scan (audit analysis) reveals blockages, leaks, and unexpected paths. The dye doesn't treat anything — it makes problems visible.
Scroll down to the access model. It's organised in three columns:
- CAN READ — audit logs, policy gate decision records, cost and billing calculation logs, data flow maps
- CAN WRITE (SYNTHETIC ONLY) — clearly-tagged test transactions via public APIs, synthetic user accounts (dye-tagged), test feedback submissions
- CANNOT DO — modify any N8R system configuration, access raw customer data or prompts, alter policies or agent behaviour
Evalunator runs on separate infrastructure — a different cloud account, potentially a different LLM provider. This isn't paranoia; it's avoiding correlated failures. If the main system's cloud provider has an outage, or the LLM develops a systematic bias, you need your auditor to be unaffected. Alerts go directly to human principals, never routed through agent functions.
"Why separate infrastructure? Isn't that overkill?"
Think about it this way: if your fire alarm runs on the same circuit as your lights, a power failure takes out both. Correlated failure is the enemy. By running Evalunator on a separate cloud account (and potentially a separate LLM provider), you ensure that the thing watching for problems can't be blinded by the same problem it's watching for.
The Four Test Categories
Still in the Evalunator tab, scroll down to the test categories. Each category tests something fundamentally different about the system's health.
Dye Injections are synthetic end-to-end transactions. Ghost Deploy pushes a fake feature through the entire engineering pipeline — does it hit the human gate? Does the audit log record every step? Invoice Round-Trip sends a synthetic invoice through Finance — does it get processed correctly and land in the right ledger?
Policy Compliance tests go further — they actively try to break the rules. Gate Crasher attempts a deploy without passing tests. Threshold Probe sends invoices at the exact boundary (£999, £1,000, £1,001). If these tests pass, it means the policies don't just exist — they actually enforce.
Chaos & Threat tests throw adversarial inputs at the system. Duplicate invoices, contradictory specs, astroturfing attacks (20 fake user requests), and even a self-injection attack where Evalunator tests whether the system properly rejects an attempt to modify its own configuration.
Drift Detection is the long game. The other categories test "does it work right now?" Drift detection compares behaviour over time — monthly re-baselines that catch silent changes. An LLM that starts interpreting a policy slightly differently, a threshold that's been quietly adjusted, a workflow step that takes twice as long as it used to.
"Why is Drift separate from Dye Injection?"
Dye injections test whether the system works right now. Drift detection compares now versus before. A dye injection test might pass today and pass next month, but if the response time doubled or the confidence score dropped, that's drift — something changed even though the outcome still looks correct. You need both.
Seeing It Work
Everything so far has been the design — how the system should work. Now let's prove it. Open the Simulation and you'll see a live, functional miniature of the entire architecture.
This isn't a mock-up. It has a real event bus routing events to function processors, a real policy engine with codified rules, and a full audit log recording every action. Let's start by injecting a manual event.
Try clicking the "Feature → Prod" button. Watch the Policy Gate Log — the gate fires and blocks the deploy. The system works exactly as the model described: no deploy without passing tests and human approval.
Now let's switch to the Evalunator view and run the full test suite.
Click "RUN ALL 13" and watch every scenario execute. But the pass/fail count isn't the interesting part — the details are. Each result expands to show the expected outcome, the actual outcome, and the full audit trace. Let's spotlight what these prove.
Ghost Deploy — a synthetic feature was pushed through Engineering. The deploy gate held correctly, with PENDING_HUMAN and BLOCKED visible in the audit trace. The result shows the expected outcome ("Production gate holds — deploy does not proceed without human") matched the actual ("Gate held correctly"). Policy isn't just written down — it fires in practice.
Invoice Round-Trip — a synthetic £500 invoice flowed through Finance and was auto-sent (under the £1,000 threshold). The audit trace shows it passed through Duplicate Detection and Invoice Threshold policy checks. When the Threshold Probe test sends invoices at £999, £1,000, and £1,001, it verifies the boundary is precise — not "around a thousand" but exactly at the threshold.
Feedback Echo — a synthetic support ticket with known characteristics was submitted and correctly classified as a "bug." The Chaos & Threat scenarios go further: Duplicate Invoice tests double-payment detection, Astroturfing floods Product with 20 fake requests to test coordinated campaign detection, and Self-Injection tests whether the system rejects attempts to modify its own configuration.
"What happens if a test fails in production?"
The alert goes directly to human principals — not routed through any agent function. This is critical: if an agent function is misbehaving, you can't rely on it to pass along the alert about its own misbehaviour. Evalunator's alert channel bypasses the entire agent layer and reaches the humans directly.
The Audit Log
Switch to the Audit view. Every action, policy evaluation, and outcome is recorded here. This is the ground truth of the system — the immutable record of what actually happened.
Notice the EVAL markers. Every synthetic transaction injected by Evalunator is clearly tagged. There's never ambiguity about whether an event was real or a test. This is essential for maintaining trust in the audit log — you need to know what's real and what's a probe.
Use the filter tabs to narrow by function — Engineering, Product, Operations, Compliance, Finance — or toggle "Eval Included" to see only Evalunator's synthetic activity. In a real system, this is what a compliance review looks like: not "trust me, it worked" but "here's the timestamped, machine-readable proof."
"Who sees the audit log in production?"
Human principals and Evalunator (read-only). Agent functions can write to the log but cannot read it in full — they only see their own function's events. This prevents any single agent from having a system-wide view that could be used to game the audit.
What This Means
Let's step back and connect the dots. There's a deliberate mirror between the N8R product and the Evalunator framework:
| N8R (PRODUCT) | EVALUNATOR (AUDIT) | |
|---|---|---|
| Watches | LLM interactions in customer pipelines | Agent operations in company workflows |
| Method | Inspects prompts, evaluates responses | Injects canaries, verifies outcomes |
| Position | External to the LLM, inline in pipeline | External to the company, observing via scoped access |
| Catches | Prompt injections, policy violations | Drift, gate failures, cascading errors |
N8R watches LLMs in customer pipelines; Evalunator watches agents in company workflows. The same pattern — an independent observer with scoped access, injecting probes and verifying outcomes — applied at two different scales. Being a security product means the company's own agent operations must be exemplary. We're eating our own cooking.
The Strangler Pattern. The simulation isn't a throwaway prototype. Its architecture maps directly to production. The plan is to agentize functions one at a time, starting with the most automatable (Engineering or Operations). Replace a simulated function processor with real agent logic; keep the policy engine and audit log identical. The simulation becomes the staging environment, then the production system, one function at a time.
This is the strangler pattern: the new system grows around the old one until nothing remains of the original. No big-bang migration, no "rewrite from scratch." Each step is testable, reversible, and independently verifiable by Evalunator.
"Isn't this marking your own homework?"
It would be, if Evalunator were part of the same system. But structural independence is the whole point: separate infrastructure, separate cloud account, potentially separate LLM provider, alerts that bypass agent functions entirely. Think of it as an external auditor with a standing invitation — not an employee checking their own work, but an independent party with continuous, scoped access and no ability to alter what they observe.
Explore Yourself
You've seen the architecture and the proof. Now it's your turn. Both the Design Model and the Simulation are fully interactive — click into any function, run individual scenarios, inject your own events, and explore the audit log. There's no wrong way to do it.