tensorone logo

Research

TensorOne Evals

TensorOne Evals is our in-house evaluation system built to benchmark and stress-test the behavior of models, agents, and chains—under structured, adversarial, and failure-prone conditions.

It’s not just about accuracy. It’s about resilience, consistency, reasoning depth, latency, and fallback success in high-complexity, multi-agent environments.


What Makes It Different?

Traditional evals focus on clean input → output comparisons. That’s not enough in dynamic agent systems. TensorOne Evals goes further by:

  • Tracking multi-hop chains and reasoning paths
  • Logging fallback logic and retries
  • Measuring latency at every node
  • Validating structure, type, and tone of model outputs
  • Stress-testing chains with adversarial mutations

Evaluation Modes

1. Scenario-Based Evals

Each test is defined as a realistic scenario, not a static prompt. We include:

  • User intent
  • Task complexity score
  • Expected outcome dimensions (e.g., reasoning steps, tone, schema)
{
  "scenario": "Summarize a hostile customer review into a neutral tone",
  "expected_traits": ["accuracy", "politeness", "emotional detachment"]
}

2. Agent Behavior Evaluation

We measure:

  • Message type distribution (inform, assert, escalate, etc.)
  • Argument chain length
  • Loop detection and resolution
  • Self-consistency across repeated runs

3. Fallback & Redundancy Testing

Each eval can trigger:

  • Endpoint failure
  • Schema violations
  • Timeout simulations
  • Memory inconsistency

MCP handles fallback routing, and eval logs show which models succeeded and why.


Key Metrics

TensorOne Evals captures and logs:

MetricDescription
latency_msTime to complete step or full chain
retry_countNumber of failed → retried steps
output_validBoolean if output passed type/schema checks
deviation_scoreDistance from expected answer format
fallback_triggeredWhether a fallback model was used
agent_loopsLoop patterns detected in agent replies

All metrics are timestamped, agent-tagged, and stored per-thread.


Tooling and Infra

  • Eval Orchestrator – Runs tests via CLI or SDK against live endpoints
  • Mutation Engine – Applies stress changes to prompt structure, memory, or tone
  • Eval Viewer – Internal UI to filter and inspect runs by model, agent, or scenario
  • LogSync – Mirrors eval logs to external analytics systems (BigQuery, Supabase, etc.)

Examples

Example Eval: Chain Robustness

Goal: Evaluate whether a 4-agent reasoning chain can recover from a failed summarization step.

scenario: 'Condense multi-agent conversation into a bullet list'
agents: [planner, researcher, critic, summarizer]
failure_injection:
  step: 'critic'
  type: 'schema_violation'
expected_behavior: 'Fallback or reroute to re-critique'

Outcome:

  • Retry triggered
  • Latency: 4200ms
  • Output: Valid
  • Route: gpt-4 → claude-3 → internal-fallback

Why We Built This

  • Static evals can't track real-time chain performance
  • Open-source tools don't capture agent behaviors or state transitions
  • We need confidence in fallback, retries, and degraded performance modes

TensorOne Evals bridges the gap between AI quality testing and runtime observability, helping us ship models and agents that are robust in production.


TensorOne Evals isn’t just about test coverage. It’s about trust.
Because when your agents are in the wild, you need more than accuracy—you need accountability.


Previous
Graphs and Finite State Machines