Request Parameters
Resource type to monitor:
all
- All resources and servicesclusters
- GPU clusters onlyendpoints
- Serverless endpoints onlytraining
- Training jobs and servicesai-services
- AI generation servicesstorage
- Storage systemsnetwork
- Network infrastructure
Time range for metrics:
5m
- Last 5 minutes15m
- Last 15 minutes1h
- Last hour6h
- Last 6 hours24h
- Last 24 hours7d
- Last 7 days30d
- Last 30 days
Data point granularity:
10s
- 10-second intervals1m
- 1-minute intervals5m
- 5-minute intervals15m
- 15-minute intervals1h
- 1-hour intervals1d
- Daily aggregation
Specific metrics to include:
cpu_utilization
- CPU usage percentagememory_utilization
- Memory usage percentagegpu_utilization
- GPU usage percentagedisk_io
- Disk I/O operations and throughputnetwork_io
- Network I/O operations and throughputresponse_time
- API response timesthroughput
- Request throughputerror_rate
- Error rates and failure countsqueue_depth
- Job queue depths
Specific resource IDs to monitor (cluster IDs, endpoint IDs, etc.)
Specific regions to include:
us-east-1
- US East (Virginia)us-west-2
- US West (Oregon)eu-west-1
- Europe (Ireland)ap-southeast-1
- Asia Pacific (Singapore)
Aggregation method for data points:
avg
- Average valuesmax
- Maximum valuesmin
- Minimum valuessum
- Sum of valuesp95
- 95th percentilep99
- 99th percentile
Response
Time range information
Overall platform performance summary
Performance metrics by resource type
Platform-wide aggregated metrics
Performance-related alerts and anomalies
Example
cURL
Python
JavaScript
Advanced Monitoring
Real-time Performance Dashboard
Create a real-time monitoring dashboard:Python
Performance Anomaly Detection
Detect unusual performance patterns:Python
Performance Optimization Recommendations
Get AI-powered optimization suggestions:Python
Best Practices
Monitoring Strategy
- Granularity: Use appropriate time granularity for your monitoring needs
- Baseline Establishment: Establish performance baselines for anomaly detection
- Alert Thresholds: Set meaningful thresholds based on historical data
- Resource-Specific Monitoring: Monitor different resource types with appropriate metrics
Performance Optimization
- Regular Analysis: Review performance metrics regularly for optimization opportunities
- Bottleneck Identification: Focus on the most constrained resources first
- Capacity Planning: Use trends to predict future resource needs
- Cost Optimization: Balance performance with cost considerations
Data Retention
- Historical Data: Keep sufficient historical data for trend analysis
- Aggregation: Use appropriate aggregation for long-term storage
- Archive Strategy: Archive old detailed metrics while keeping summaries
- Compliance: Ensure data retention meets compliance requirements
Performance metrics are updated every 30 seconds for real-time monitoring. Historical data is available for up to 90 days at full granularity.
Use performance baselines and anomaly detection to proactively identify issues before they impact users. Set up automated alerts for critical performance thresholds.