TensorOne Evals

Our internal evaluation framework, TensorOne Evals, is used to benchmark and stress-test agents, models, and chain-based systems in real-world, failure-prone scenarios.

TensorOne Evals prioritises robustness, reasoning quality, latency, and system resilience across dynamic, multi-agent environments, in contrast to traditional evaluation techniques that only consider input-output accuracy.

Key Differentiators

Traditional evaluations fall short when applied to complex systems. TensorOne Evals introduces:

Multi-hop reasoning traceability
Logging of fallbacks, retries, and failure handling
Fine-grained latency measurement across steps
Structural, schema, and tone validation of outputs
Prompt mutation for stress and edge-case simulation

Evaluation Modes

1. Scenario-Based Evaluation

Each test is defined as a structured scenario with:

Task intent and context
Complexity score
Target traits for output (e.g., tone, reasoning depth)

{
  "scenario": "Summarize a hostile customer review in neutral language",
  "expected_traits": ["accuracy", "politeness", "emotional neutrality"]
}

2. Agent Behavior Evaluation

Track internal agent dynamics:

Distribution of message types (inform, assert, escalate, etc.)
Length and depth of reasoning chains
Loop detection and resolution metrics
Output consistency across repeated runs

3. Fallback and Redundancy Testing

Simulates failure scenarios:

Endpoint unavailability
Schema violations
Timeout conditions
Memory inconsistencies

Fallback handling is logged with routing data via MCP, showing which models responded and why.

Metrics Captured

Metric	Description
`latency_ms`	Time taken per step or full chain
`retry_count`	Number of failed → retried steps
`output_valid`	Boolean indicating schema/type compliance
`deviation_score`	Difference from expected output format
`fallback_triggered`	Indicates if fallback logic was activated
`agent_loops`	Repetitive patterns in agent chains

All metrics are thread-scoped, timestamped, and agent-tagged for traceability.

Tooling and Infrastructure

Eval Orchestrator – CLI/SDK tool to run scenarios against live endpoints
Mutation Engine – Applies structural, semantic, and tone perturbations
Eval Viewer – Internal UI for filtering and inspecting results
LogSync – Syncs logs to analytics platforms (e.g., BigQuery, Supabase)

Example: Chain Robustness Eval

Scenario: Condense a multi-agent conversation into a structured bullet list.

yamlCopyEditagents:
  - planner
  - researcher
  - critic
  - summarizer
failure_injection:
  step: 'critic'
  type: 'schema_violation'
expected_behavior: 'Fallback or reroute to re-critique'

Outcome:

Retry triggered successfully
Total latency: 4200ms
Output validated and structured -Route: gpt-4 → claude-3 → internal-fallback

Why It Exists

Standard benchmarks don’t measure agent coordination or failure handling
Open-source tools lack support for chained reasoning workflows
Real-world systems require observability under degraded conditions

TensorOne Evals closes the gap between development and deployment readiness - ensuring systems behave reliably not only when they work, but when they don’t.