AI Agent Testing Framework

Why AI agent testing is different from software testing

Traditional software tests check whether a function returns a deterministic output given a fixed input. AI agent tests need to check something more: whether the agent's reasoning about live, variable data produces safe, correct, and bounded proposals.

The same agent can produce different outputs on the same workflow if the data has changed between runs. Testing needs to cover the reasoning quality and the approval gates, not just the function return value.

Testing layers

Layer	What to test	When to run
Instruction testing	Verify the agent interprets its instructions correctly with sample inputs	Before connecting to any live stack
Permission testing	Verify the agent can read the data it needs and cannot access data outside its scope	After connecting the stack, before enabling writes
Approval gate testing	Verify proposed actions appear correctly in the approval queue and can be approved or rejected	After permission testing, before live runs
Write testing	Verify writes execute correctly and produce expected outcomes in the target stack	With a single test record before full production rollout
Edge case testing	Verify agent handles missing data, unexpected input formats, and empty queues gracefully	Before full production rollout
Regression testing	Re-run the test suite after any change to instructions, permissions, or connected stacks	On every change to the agent or its stack connections

Sandbox testing

Sandbox mode reads live data and runs the full reasoning pipeline, but disables writes. Every proposed action is visible in the approval queue without executing. This is the primary testing environment before production.

What to verify in sandbox mode:

Agent reads the correct objects and fields from the connected stack
Proposed actions match the intended workflow (correct fields, correct values, correct targets)
Agent handles empty queues and missing data without errors
Proposed actions appear in the approval queue with enough context for a reviewer to approve or reject
Rejected actions are handled gracefully (agent logs the rejection, does not retry silently)

Pre-production checklist

Instruction test passed: agent produces correct proposals on normal sample inputs
Permission test passed: agent can read required objects, cannot access out-of-scope data
Approval gate test passed: proposals appear in queue, approve and reject both work
Edge case test passed: missing data, empty queue, and malformed input handled without crash
Write test passed: single test record written correctly to target stack
Spend cap set and verified
Alert rules configured
Workflow owner briefed on approval process

Regression testing

Run the full test suite again whenever any of the following change:

Agent instructions updated
Connected stack OAuth credentials rotated
Stack API version changed by the vendor
Approval gate configuration changed
New team member added as approver
Spend cap adjusted

The fastest regression test is a sandbox run followed by spot-checking the top 3 proposed actions against expected outputs. A full regression run reviews all test cases in the suite.

Frequently asked questions

What is sandbox mode and how does it work?

Sandbox mode runs the agent against live data but disables all writes. The agent reads your real systems, reasons about the data, and produces proposed actions exactly as it would in production. You can review every proposal without any risk of data being changed.

How many test cases do I need before going to production?

At minimum: one normal case (the agent's primary workflow functioning as expected), one edge case (missing or incomplete data), and one rejection case (an approver rejects a proposed action and the agent handles it correctly). For high-volume or compliance-sensitive workflows, expand the test suite to cover the top 10 input variations your team encounters.

Do I need to retest after changing the agent's instructions?

Yes. Any change to instructions, permissions, or connected stack configurations should trigger a regression test run. Even small instruction changes can produce different reasoning outputs on the same inputs.

What should I test when a connected stack updates its API?

Run permission tests first to confirm OAuth scopes still work. Then run your standard test suite to check read and write operations. Stack API changes often affect field names, pagination behaviour, or rate limits, so check for errors in the run trace after the test.

Who should run agent tests?

The technical enabler who builds the agent runs the first test suite. Before going live, include the business owner of the workflow in the approval gate test so they can confirm the proposed actions match their expectations.