Overview
Cluster Metrics provide detailed performance monitoring and analytics for GPU clusters including GPU utilization, memory usage, temperature, power consumption, network traffic, and custom application metrics. Essential for optimizing performance, cost management, and capacity planning.Endpoints
Get Current Metrics
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics
Get Historical Metrics
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/historical
Get Aggregated Metrics
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/aggregated
Create Custom Metric
Copy
POST https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/custom
Get Current Metrics
Query Parameters
Parameter | Type | Required | Description |
---|---|---|---|
metrics | array | No | Specific metrics to retrieve (default: all) |
gpu_id | string | No | Specific GPU ID (default: all GPUs) |
include_processes | boolean | No | Include running process information (default: false) |
include_temperature | boolean | No | Include temperature sensors (default: true) |
include_power | boolean | No | Include power consumption data (default: true) |
include_custom | boolean | No | Include custom application metrics (default: true) |
Available Metrics
Category | Metrics | Description |
---|---|---|
gpu | utilization , memory_used , memory_total , temperature , power | GPU hardware metrics |
cpu | utilization , load_avg , memory_used , memory_total | CPU and system memory |
storage | disk_used , disk_total , disk_io_read , disk_io_write | Storage and I/O metrics |
network | bytes_sent , bytes_received , packets_sent , packets_received | Network traffic |
application | custom_metrics , process_metrics | Application-specific metrics |
Request Examples
Copy
# Get all current metrics
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics" \
-H "Authorization: Bearer YOUR_API_KEY"
# Get GPU metrics only with process information
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics?metrics=gpu&include_processes=true" \
-H "Authorization: Bearer YOUR_API_KEY"
# Get metrics for specific GPU
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics?gpu_id=0" \
-H "Authorization: Bearer YOUR_API_KEY"
Historical Metrics
Query Parameters
Parameter | Type | Required | Description |
---|---|---|---|
start_time | string | Yes | Start time (ISO 8601 format) |
end_time | string | Yes | End time (ISO 8601 format) |
interval | string | No | Data point interval: 1m , 5m , 15m , 1h , 6h , 1d (default: 5m ) |
metrics | array | No | Specific metrics to retrieve |
aggregation | string | No | Aggregation method: avg , min , max , sum (default: avg ) |
gpu_id | string | No | Specific GPU ID |
Copy
# Get last 24 hours of GPU metrics
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics/historical?start_time=2024-01-14T00:00:00Z&end_time=2024-01-15T00:00:00Z&metrics=gpu&interval=1h" \
-H "Authorization: Bearer YOUR_API_KEY"
Response Schema
Copy
{
"success": true,
"data": {
"cluster_id": "cluster_abc123",
"timestamp": "2024-01-15T16:30:00Z",
"gpu": {
"utilization": {
"0": 87.5,
"1": 89.2,
"2": 85.8,
"3": 88.1
},
"memory": {
"0": {
"used": 68719476736,
"total": 85899345920,
"utilization_percent": 80.0
},
"1": {
"used": 70866960384,
"total": 85899345920,
"utilization_percent": 82.5
}
},
"temperature": {
"0": 76.0,
"1": 78.5,
"2": 75.2,
"3": 77.8
},
"power": {
"0": 285.0,
"1": 298.5,
"2": 280.2,
"3": 292.1
},
"clock_speeds": {
"0": {
"graphics_clock": 1755,
"memory_clock": 877
}
}
},
"cpu": {
"utilization": 45.2,
"load_avg": [2.45, 2.12, 1.98],
"memory_used": 137438953472,
"memory_total": 274877906944,
"cores": 32,
"threads": 64
},
"storage": {
"disk_used": 429496729600,
"disk_total": 1099511627776,
"utilization_percent": 39.1,
"io_stats": {
"read_bytes_per_sec": 52428800,
"write_bytes_per_sec": 31457280,
"read_ops_per_sec": 128,
"write_ops_per_sec": 76
}
},
"network": {
"bytes_sent": 2684354560,
"bytes_received": 1073741824,
"packets_sent": 2048000,
"packets_received": 1024000,
"interfaces": [
{
"name": "eth0",
"bytes_sent": 2684354560,
"bytes_received": 1073741824,
"speed_mbps": 1000,
"status": "up"
}
]
},
"processes": [
{
"pid": 1234,
"name": "python",
"command": "python train.py --model gpt2",
"cpu_percent": 15.2,
"memory_bytes": 8589934592,
"gpu_memory_usage": 6442450944,
"gpu_utilization": 85.0,
"user": "ml_user",
"runtime_seconds": 14400
}
],
"custom_metrics": {
"training_loss": 0.234,
"learning_rate": 0.001,
"batch_size": 32,
"epoch": 15,
"samples_per_second": 1250.5
}
},
"meta": {
"collection_time_ms": 45,
"metrics_version": "2.1.0"
}
}
Custom Metrics
Create and track application-specific metrics for your workloads.Copy
# Create custom training metric
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics/custom" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "training_accuracy",
"description": "Model training accuracy percentage",
"type": "gauge",
"unit": "percent",
"tags": ["training", "ml", "accuracy"],
"value": 94.5,
"timestamp": "2024-01-15T16:30:00Z"
}'
Use Cases
Training Performance Monitoring
Monitor ML training jobs with comprehensive metrics.Copy
class TrainingMonitor:
def __init__(self, cluster_id, experiment_name):
self.cluster_id = cluster_id
self.experiment_name = experiment_name
self.metrics_history = []
def log_training_step(self, step, loss, accuracy, learning_rate):
"""Log metrics for a training step"""
# Get current system metrics
system_metrics = get_cluster_metrics(self.cluster_id, ["gpu", "cpu"])
# Create custom training metrics
training_metrics = {
"training_loss": loss,
"training_accuracy": accuracy,
"learning_rate": learning_rate,
"training_step": step
}
# Log custom metrics
for name, value in training_metrics.items():
create_custom_metric(
self.cluster_id,
f"{self.experiment_name}_{name}",
value,
tags=["training", self.experiment_name]
)
# Store for local analysis
self.metrics_history.append({
"step": step,
"timestamp": time.time(),
"training": training_metrics,
"system": system_metrics.get("data", {}) if system_metrics.get("success") else {}
})
# Check for performance issues
self.check_performance_alerts(system_metrics.get("data", {}))
def check_performance_alerts(self, system_metrics):
"""Check for performance issues and alert"""
alerts = []
# Check GPU utilization
if "gpu" in system_metrics:
gpu_utils = system_metrics["gpu"].get("utilization", {})
avg_util = sum(gpu_utils.values()) / len(gpu_utils) if gpu_utils else 0
if avg_util < 70:
alerts.append(f"Low GPU utilization: {avg_util:.1f}%")
# Check GPU temperature
gpu_temps = system_metrics["gpu"].get("temperature", {})
max_temp = max(gpu_temps.values()) if gpu_temps else 0
if max_temp > 83:
alerts.append(f"High GPU temperature: {max_temp}°C")
# Check memory usage
if "cpu" in system_metrics:
memory_used = system_metrics["cpu"].get("memory_used", 0)
memory_total = system_metrics["cpu"].get("memory_total", 1)
memory_percent = (memory_used / memory_total) * 100
if memory_percent > 90:
alerts.append(f"High memory usage: {memory_percent:.1f}%")
if alerts:
print(f"⚠️ Performance alerts for {self.experiment_name}:")
for alert in alerts:
print(f" • {alert}")
def get_training_summary(self):
"""Get training performance summary"""
if not self.metrics_history:
return {"error": "No metrics history available"}
steps = [m["step"] for m in self.metrics_history]
losses = [m["training"]["training_loss"] for m in self.metrics_history]
accuracies = [m["training"]["training_accuracy"] for m in self.metrics_history]
gpu_utils = []
for m in self.metrics_history:
if "gpu" in m["system"] and "utilization" in m["system"]["gpu"]:
utils = m["system"]["gpu"]["utilization"].values()
avg_util = sum(utils) / len(utils) if utils else 0
gpu_utils.append(avg_util)
return {
"experiment": self.experiment_name,
"total_steps": max(steps) if steps else 0,
"final_loss": losses[-1] if losses else None,
"final_accuracy": accuracies[-1] if accuracies else None,
"loss_improvement": losses[0] - losses[-1] if len(losses) > 1 else 0,
"avg_gpu_utilization": sum(gpu_utils) / len(gpu_utils) if gpu_utils else 0,
"training_efficiency": self.calculate_efficiency()
}
def calculate_efficiency(self):
"""Calculate training efficiency score"""
if not self.metrics_history:
return 0
# Efficiency based on GPU utilization and loss improvement
gpu_utils = []
for m in self.metrics_history:
if "gpu" in m["system"] and "utilization" in m["system"]["gpu"]:
utils = m["system"]["gpu"]["utilization"].values()
avg_util = sum(utils) / len(utils) if utils else 0
gpu_utils.append(avg_util)
avg_gpu_util = sum(gpu_utils) / len(gpu_utils) if gpu_utils else 0
losses = [m["training"]["training_loss"] for m in self.metrics_history]
loss_improvement = (losses[0] - losses[-1]) / losses[0] if len(losses) > 1 and losses[0] > 0 else 0
# Efficiency score (0-100)
efficiency = (avg_gpu_util * 0.6 + loss_improvement * 100 * 0.4)
return min(100, max(0, efficiency))
# Usage
monitor = TrainingMonitor("cluster_training_001", "gpt2_finetuning")
# During training loop
for step in range(1000):
# ... training code ...
loss = train_step() # Your training function
accuracy = evaluate() # Your evaluation function
lr = optimizer.param_groups[0]['lr']
# Log metrics every 10 steps
if step % 10 == 0:
monitor.log_training_step(step, loss, accuracy, lr)
# Get final summary
summary = monitor.get_training_summary()
print(f"Training completed with {summary['training_efficiency']:.1f}% efficiency")
Cost Optimization Analytics
Analyze metrics to optimize cluster costs and performance.Copy
class CostOptimizationAnalyzer {
constructor(clusterId) {
this.clusterId = clusterId;
}
async analyzeWeeklyUsage() {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - 7 * 24 * 60 * 60 * 1000); // 7 days ago
const metrics = await this.getHistoricalMetrics(startTime, endTime, '1h');
if (!metrics.success) {
throw new Error('Failed to get metrics for analysis');
}
const analysis = {
period: {
start: startTime.toISOString(),
end: endTime.toISOString(),
duration_hours: 168 // 7 days
},
utilization_analysis: this.analyzeUtilization(metrics.data.data_points),
cost_efficiency: this.calculateCostEfficiency(metrics.data.data_points),
optimization_recommendations: []
};
// Generate recommendations
analysis.optimization_recommendations = this.generateRecommendations(analysis);
return analysis;
}
analyzeUtilization(dataPoints) {
const gpuUtils = [];
const cpuUtils = [];
const lowUtilizationHours = [];
dataPoints.forEach((point, index) => {
if (point.gpu?.utilization) {
const avgGpuUtil = Object.values(point.gpu.utilization)
.reduce((sum, util) => sum + util, 0) / Object.keys(point.gpu.utilization).length;
gpuUtils.push(avgGpuUtil);
if (avgGpuUtil < 20) {
lowUtilizationHours.push({
timestamp: point.timestamp,
gpu_utilization: avgGpuUtil,
hour_of_week: new Date(point.timestamp).getHours()
});
}
}
if (point.cpu?.utilization) {
cpuUtils.push(point.cpu.utilization);
}
});
return {
gpu: {
average: gpuUtils.reduce((sum, util) => sum + util, 0) / gpuUtils.length,
min: Math.min(...gpuUtils),
max: Math.max(...gpuUtils),
low_utilization_hours: lowUtilizationHours.length,
efficiency_score: this.calculateEfficiencyScore(gpuUtils)
},
cpu: {
average: cpuUtils.reduce((sum, util) => sum + util, 0) / cpuUtils.length,
min: Math.min(...cpuUtils),
max: Math.max(...cpuUtils)
}
};
}
calculateEfficiencyScore(utilizations) {
// Efficiency score based on how much time is spent at good utilization levels
const goodUtilization = utilizations.filter(util => util >= 70 && util <= 95);
return (goodUtilization.length / utilizations.length) * 100;
}
calculateCostEfficiency(dataPoints) {
// Assume hourly rate is available in cluster info
const hourlyRate = 8.50; // Would get this from cluster info
const totalHours = dataPoints.length;
const productiveHours = dataPoints.filter(point => {
if (!point.gpu?.utilization) return false;
const avgUtil = Object.values(point.gpu.utilization)
.reduce((sum, util) => sum + util, 0) / Object.keys(point.gpu.utilization).length;
return avgUtil >= 50; // Consider 50%+ utilization as productive
}).length;
const totalCost = totalHours * hourlyRate;
const productiveCost = productiveHours * hourlyRate;
const wastedCost = totalCost - productiveCost;
return {
total_cost: totalCost,
productive_cost: productiveCost,
wasted_cost: wastedCost,
cost_efficiency_percent: (productiveCost / totalCost) * 100,
potential_monthly_savings: (wastedCost / 7) * 30 // Weekly to monthly
};
}
generateRecommendations(analysis) {
const recommendations = [];
// Low utilization recommendation
if (analysis.utilization_analysis.gpu.average < 60) {
recommendations.push({
type: 'utilization',
priority: 'high',
title: 'Low GPU Utilization Detected',
description: `Average GPU utilization is ${analysis.utilization_analysis.gpu.average.toFixed(1)}%`,
suggestion: 'Consider using a smaller GPU type or implementing auto-scaling',
potential_savings: analysis.cost_efficiency.potential_monthly_savings
});
}
// Idle time recommendation
if (analysis.utilization_analysis.gpu.low_utilization_hours > 24) {
recommendations.push({
type: 'scheduling',
priority: 'medium',
title: 'Extended Idle Periods',
description: `${analysis.utilization_analysis.gpu.low_utilization_hours} hours of low utilization detected`,
suggestion: 'Implement auto-termination or scheduled shutdown during idle periods',
potential_savings: (analysis.utilization_analysis.gpu.low_utilization_hours * 8.50) * 4 // Weekly to monthly
});
}
// Efficiency recommendation
if (analysis.utilization_analysis.gpu.efficiency_score < 70) {
recommendations.push({
type: 'optimization',
priority: 'medium',
title: 'Low Efficiency Score',
description: `GPU efficiency score is ${analysis.utilization_analysis.gpu.efficiency_score.toFixed(1)}%`,
suggestion: 'Review workload distribution and consider batch size optimization',
potential_improvement: '20-30% efficiency gain possible'
});
}
return recommendations;
}
async getHistoricalMetrics(startTime, endTime, interval) {
const params = new URLSearchParams({
start_time: startTime.toISOString(),
end_time: endTime.toISOString(),
interval: interval,
metrics: 'gpu,cpu'
});
const response = await fetch(
`https://api.tensorone.ai/v1/clusters/${this.clusterId}/metrics/historical?${params}`,
{
headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
}
);
return await response.json();
}
}
// Usage
const analyzer = new CostOptimizationAnalyzer('cluster_abc123');
const analysis = await analyzer.analyzeWeeklyUsage();
console.log('Weekly Cost Analysis:');
console.log(`Total Cost: $${analysis.cost_efficiency.total_cost.toFixed(2)}`);
console.log(`Wasted Cost: $${analysis.cost_efficiency.wasted_cost.toFixed(2)}`);
console.log(`Efficiency: ${analysis.cost_efficiency.cost_efficiency_percent.toFixed(1)}%`);
analysis.optimization_recommendations.forEach((rec, index) => {
console.log(`\nRecommendation ${index + 1}: ${rec.title}`);
console.log(`Priority: ${rec.priority}`);
console.log(`Description: ${rec.description}`);
console.log(`Suggestion: ${rec.suggestion}`);
if (rec.potential_savings) {
console.log(`Potential Monthly Savings: $${rec.potential_savings.toFixed(2)}`);
}
});
Error Handling
Copy
{
"success": false,
"error": {
"code": "METRICS_NOT_AVAILABLE",
"message": "Metrics collection is not available for this cluster",
"details": {
"cluster_status": "stopped",
"reason": "Metrics are only available for running clusters",
"suggestion": "Start the cluster to begin metrics collection"
}
}
}
Best Practices
- Regular Monitoring: Set up automated monitoring for critical metrics
- Alert Thresholds: Configure appropriate alert thresholds for your workloads
- Historical Analysis: Use historical data for capacity planning and optimization
- Custom Metrics: Track application-specific metrics for better insights
- Cost Optimization: Regular analysis of utilization metrics for cost savings
- Performance Tuning: Use metrics to identify and resolve performance bottlenecks
Authorizations
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Path Parameters
Response
200 - application/json
Cluster metrics
The response is of type object
.