Foundations
TensorOne Evals
Our internal evaluation framework, TensorOne Evals, is used to benchmark and stress-test agents, models, and chain-based systems in real-world, failure-prone scenarios.
TensorOne Evals prioritises robustness, reasoning quality, latency, and system resilience across dynamic, multi-agent environments, in contrast to traditional evaluation techniques that only consider input-output accuracy.
Key Differentiators
Traditional evaluations fall short when applied to complex systems. TensorOne Evals introduces:
- Multi-hop reasoning traceability
- Logging of fallbacks, retries, and failure handling
- Fine-grained latency measurement across steps
- Structural, schema, and tone validation of outputs
- Prompt mutation for stress and edge-case simulation
Evaluation Modes
1. Scenario-Based Evaluation
Each test is defined as a structured scenario with:
- Task intent and context
- Complexity score
- Target traits for output (e.g., tone, reasoning depth)
{
"scenario": "Summarize a hostile customer review in neutral language",
"expected_traits": ["accuracy", "politeness", "emotional neutrality"]
}
2. Agent Behavior Evaluation
Track internal agent dynamics:
- Distribution of message types (inform, assert, escalate, etc.)
- Length and depth of reasoning chains
- Loop detection and resolution metrics
- Output consistency across repeated runs
3. Fallback and Redundancy Testing
Simulates failure scenarios:
- Endpoint unavailability
- Schema violations
- Timeout conditions
- Memory inconsistencies
Fallback handling is logged with routing data via MCP, showing which models responded and why.
Metrics Captured
Metric | Description |
---|---|
latency_ms | Time taken per step or full chain |
retry_count | Number of failed → retried steps |
output_valid | Boolean indicating schema/type compliance |
deviation_score | Difference from expected output format |
fallback_triggered | Indicates if fallback logic was activated |
agent_loops | Repetitive patterns in agent chains |
All metrics are thread-scoped, timestamped, and agent-tagged for traceability.
Tooling and Infrastructure
- Eval Orchestrator – CLI/SDK tool to run scenarios against live endpoints
- Mutation Engine – Applies structural, semantic, and tone perturbations
- Eval Viewer – Internal UI for filtering and inspecting results
- LogSync – Syncs logs to analytics platforms (e.g., BigQuery, Supabase)
Example: Chain Robustness Eval
Scenario: Condense a multi-agent conversation into a structured bullet list.
yamlCopyEditagents:
- planner
- researcher
- critic
- summarizer
failure_injection:
step: 'critic'
type: 'schema_violation'
expected_behavior: 'Fallback or reroute to re-critique'
Outcome:
- Retry triggered successfully
- Total latency: 4200ms
- Output validated and structured -Route:
gpt-4 → claude-3 → internal-fallback
Why It Exists
- Standard benchmarks don’t measure agent coordination or failure handling
- Open-source tools lack support for chained reasoning workflows
- Real-world systems require observability under degraded conditions
TensorOne Evals closes the gap between development and deployment readiness - ensuring systems behave reliably not only when they work, but when they don’t.