Why systematic troubleshooting matters
When an agent fails, the wrong next step is guessing. The right next step is to review exactly what the agent saw, what it tried to do, and where the breakdown happened.
A calm troubleshooting process protects trust. Teams can see whether the problem came from missing context, permissions, a changed external tool, or a task that should still stay approval-first.
Troubleshooting process
1. Start with the last failed run
Review the activity history, proposed action, tool response, and any approval notes. You need the full sequence, not just the final error line.
2. Check whether the agent had the right context
Look for missing fields, stale records, conflicting instructions, or incomplete attachments. Many agent failures begin as data problems.
3. Check permissions and tool access
Confirm the agent can still reach the system it needs and that the account still has the right access. Revoked permissions and expired credentials are common after setup changes.
4. Separate bad decisions from bad tool responses
If the agent chose the wrong action, tighten the instructions or examples. If the action was right but the tool rejected it, fix the external connection, data, or destination rules.
5. Replay safely with approvals still on
Run the same task again in a safe review flow. Watch whether the failure repeats and note what changed between the failed and successful runs.
6. Write down the fix and the new guardrail
Capture what broke, how you fixed it, and what should prevent it next time. That might mean better instructions, clearer approvals, tighter permissions, or stronger input checks.
Common issues and fixes
| Issue | Fix |
|---|---|
| Missing or incomplete data | Ask for the required fields before approving the run. |
| Revoked permissions | Reconnect the tool or restore the right access before retrying. |
| Destination system rules changed | Update the task or field mapping to match the current rules. |
| Agent chooses the wrong action | Add better instructions and examples for that case. |
| External tool is slow or unavailable | Pause the run and retry only when the tool is healthy. |
| Team cannot tell why the agent failed | Make the plan, action, and error state easier to review. |
Best practices
- Start with the last failed run. Review what actually happened before you change anything.
- Keep approvals on while you investigate. Do not let a shaky task keep acting on its own.
- Change one thing at a time. Small, clear tests make the cause obvious faster.
- Fix the root cause, not just the symptom. A retry is not a fix if the same condition will happen again.
- Turn repeated failures into better guardrails. If a case keeps breaking, update instructions, approvals, or input checks.
Frequently asked questions
Where should we start when an agent fails?
Start with the last failed run. Review the plan, action, tool, input data, and error message together. Most failures come from missing context, revoked access, or a changed rule in the destination system.
How do we differentiate between agent errors and external system errors?
If the same action fails outside the agent, it is usually a tool or system issue. If the tool works normally but the agent chooses the wrong action or lacks context, tighten the instructions or fix the data it is using.
What if we cannot reproduce the failure?
Look at the activity history and approval trail. Transient failures often come from timing, permissions, or incomplete records. If you cannot replay it, capture the exact inputs and keep approvals on until the pattern is clear.
Should we fix failures immediately or batch them?
Fix anything that blocks a core task or creates risky suggestions right away. Lower-impact edge cases can be grouped into a review list once the main task is stable.