Research
TensorOne Evals
TensorOne Evals is our in-house evaluation system built to benchmark and stress-test the behavior of models, agents, and chains—under structured, adversarial, and failure-prone conditions.
It’s not just about accuracy. It’s about resilience, consistency, reasoning depth, latency, and fallback success in high-complexity, multi-agent environments.
What Makes It Different?
Traditional evals focus on clean input → output comparisons. That’s not enough in dynamic agent systems. TensorOne Evals goes further by:
- Tracking multi-hop chains and reasoning paths
- Logging fallback logic and retries
- Measuring latency at every node
- Validating structure, type, and tone of model outputs
- Stress-testing chains with adversarial mutations
Evaluation Modes
1. Scenario-Based Evals
Each test is defined as a realistic scenario, not a static prompt. We include:
- User intent
- Task complexity score
- Expected outcome dimensions (e.g., reasoning steps, tone, schema)
{
"scenario": "Summarize a hostile customer review into a neutral tone",
"expected_traits": ["accuracy", "politeness", "emotional detachment"]
}
2. Agent Behavior Evaluation
We measure:
- Message type distribution (inform, assert, escalate, etc.)
- Argument chain length
- Loop detection and resolution
- Self-consistency across repeated runs
3. Fallback & Redundancy Testing
Each eval can trigger:
- Endpoint failure
- Schema violations
- Timeout simulations
- Memory inconsistency
MCP handles fallback routing, and eval logs show which models succeeded and why.
Key Metrics
TensorOne Evals captures and logs:
Metric | Description |
---|---|
latency_ms | Time to complete step or full chain |
retry_count | Number of failed → retried steps |
output_valid | Boolean if output passed type/schema checks |
deviation_score | Distance from expected answer format |
fallback_triggered | Whether a fallback model was used |
agent_loops | Loop patterns detected in agent replies |
All metrics are timestamped, agent-tagged, and stored per-thread.
Tooling and Infra
- Eval Orchestrator – Runs tests via CLI or SDK against live endpoints
- Mutation Engine – Applies stress changes to prompt structure, memory, or tone
- Eval Viewer – Internal UI to filter and inspect runs by model, agent, or scenario
- LogSync – Mirrors eval logs to external analytics systems (BigQuery, Supabase, etc.)
Examples
Example Eval: Chain Robustness
Goal: Evaluate whether a 4-agent reasoning chain can recover from a failed summarization step.
scenario: 'Condense multi-agent conversation into a bullet list'
agents: [planner, researcher, critic, summarizer]
failure_injection:
step: 'critic'
type: 'schema_violation'
expected_behavior: 'Fallback or reroute to re-critique'
Outcome:
- Retry triggered
- Latency: 4200ms
- Output: Valid
- Route: gpt-4 → claude-3 → internal-fallback
Why We Built This
- Static evals can't track real-time chain performance
- Open-source tools don't capture agent behaviors or state transitions
- We need confidence in fallback, retries, and degraded performance modes
TensorOne Evals bridges the gap between AI quality testing and runtime observability, helping us ship models and agents that are robust in production.
TensorOne Evals isn’t just about test coverage. It’s about trust.
Because when your agents are in the wild, you need more than accuracy—you need accountability.