tensorone logo

Foundations

TensorOne Evals

Our internal evaluation framework, TensorOne Evals, is used to benchmark and stress-test agents, models, and chain-based systems in real-world, failure-prone scenarios.

TensorOne Evals prioritises robustness, reasoning quality, latency, and system resilience across dynamic, multi-agent environments, in contrast to traditional evaluation techniques that only consider input-output accuracy.


Key Differentiators

Traditional evaluations fall short when applied to complex systems. TensorOne Evals introduces:

  • Multi-hop reasoning traceability
  • Logging of fallbacks, retries, and failure handling
  • Fine-grained latency measurement across steps
  • Structural, schema, and tone validation of outputs
  • Prompt mutation for stress and edge-case simulation

Evaluation Modes

1. Scenario-Based Evaluation

Each test is defined as a structured scenario with:

  • Task intent and context
  • Complexity score
  • Target traits for output (e.g., tone, reasoning depth)
{
  "scenario": "Summarize a hostile customer review in neutral language",
  "expected_traits": ["accuracy", "politeness", "emotional neutrality"]
}

2. Agent Behavior Evaluation

Track internal agent dynamics:

  • Distribution of message types (inform, assert, escalate, etc.)
  • Length and depth of reasoning chains
  • Loop detection and resolution metrics
  • Output consistency across repeated runs

3. Fallback and Redundancy Testing

Simulates failure scenarios:

  • Endpoint unavailability
  • Schema violations
  • Timeout conditions
  • Memory inconsistencies

Fallback handling is logged with routing data via MCP, showing which models responded and why.


Metrics Captured

MetricDescription
latency_msTime taken per step or full chain
retry_countNumber of failed → retried steps
output_validBoolean indicating schema/type compliance
deviation_scoreDifference from expected output format
fallback_triggeredIndicates if fallback logic was activated
agent_loopsRepetitive patterns in agent chains

All metrics are thread-scoped, timestamped, and agent-tagged for traceability.


Tooling and Infrastructure

  • Eval Orchestrator – CLI/SDK tool to run scenarios against live endpoints
  • Mutation Engine – Applies structural, semantic, and tone perturbations
  • Eval Viewer – Internal UI for filtering and inspecting results
  • LogSync – Syncs logs to analytics platforms (e.g., BigQuery, Supabase)

Example: Chain Robustness Eval

Scenario: Condense a multi-agent conversation into a structured bullet list.

yamlCopyEditagents:
  - planner
  - researcher
  - critic
  - summarizer
failure_injection:
  step: 'critic'
  type: 'schema_violation'
expected_behavior: 'Fallback or reroute to re-critique'

Outcome:

  • Retry triggered successfully
  • Total latency: 4200ms
  • Output validated and structured -Route: gpt-4 → claude-3 → internal-fallback

Why It Exists

  • Standard benchmarks don’t measure agent coordination or failure handling
  • Open-source tools lack support for chained reasoning workflows
  • Real-world systems require observability under degraded conditions

TensorOne Evals closes the gap between development and deployment readiness - ensuring systems behave reliably not only when they work, but when they don’t.

Previous
Agent & GPU Graphs in Practice