LLM test tools validate text output. But agents don't just generate text — they pick tools, handle failures, and process user data autonomously. One bad tool call → PII leak. One missed step → silent ...