Path Parameters
Unique identifier of the training job to evaluate
Request Body
Specific checkpoint to evaluate (defaults to best checkpoint if not specified)
Human-readable name for this evaluation run
Test dataset configuration
List of metrics to compute during evaluation
Additional evaluation configuration
Optional baselines to compare against
Response
Unique identifier for this evaluation run
Evaluation status:
queued
, running
, completed
, failed
Estimated time to complete evaluation
Real-time evaluation progress
Example
cURL
Python
JavaScript
Get Evaluation Results
Retrieve detailed results from a completed evaluation:cURL
Python
Evaluation Types
Standard Evaluation
Basic model performance assessment on test data:Comprehensive Evaluation
Detailed analysis with multiple metrics and baselines:A/B Testing Evaluation
Compare multiple model versions:Domain-Specific Evaluation
Custom metrics for specialized domains:Advanced Evaluation Features
Error Analysis
Automatically analyze common failure patterns:Fairness Assessment
Evaluate model fairness across different groups:Robustness Testing
Test model performance under various conditions:Evaluation Reports
Detailed evaluation reports include:- Executive Summary: High-level performance overview
- Metric Analysis: Detailed breakdown of all computed metrics
- Confusion Matrix: Visual representation of classification results
- Error Analysis: Common failure patterns and examples
- Baseline Comparison: Performance vs baseline models
- Recommendations: Suggestions for model improvement
Best Practices
Test Dataset Preparation
- Use representative test data that mirrors production distribution
- Ensure test data is completely separate from training data
- Include edge cases and challenging examples
- Balance dataset if needed for fair evaluation
Metric Selection
- Choose metrics appropriate for your use case
- Include both aggregate and per-class metrics
- Consider business-relevant metrics beyond accuracy
- Use multiple metrics to get comprehensive view
Evaluation Monitoring
Evaluation results are stored for 90 days and can be accessed anytime during this period. Download important reports for long-term storage.
Use evaluation results to make data-driven decisions about model deployment, hyperparameter tuning, and data collection strategies.