Get Cluster Metrics
curl --request GET \
  --url https://api.tensorone.ai/v2/clusters/{cluster_id}/metrics \
  --header 'Authorization: <api-key>'
{
  "cpuUsage": 123,
  "memoryUsage": 123,
  "gpuUsage": 123,
  "diskUsage": 123,
  "networkIn": 123,
  "networkOut": 123
}

Overview

Cluster Metrics provide detailed performance monitoring and analytics for GPU clusters including GPU utilization, memory usage, temperature, power consumption, network traffic, and custom application metrics. Essential for optimizing performance, cost management, and capacity planning.

Endpoints

Get Current Metrics

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics

Get Historical Metrics

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/historical

Get Aggregated Metrics

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/aggregated

Create Custom Metric

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/metrics/custom

Get Current Metrics

Query Parameters

ParameterTypeRequiredDescription
metricsarrayNoSpecific metrics to retrieve (default: all)
gpu_idstringNoSpecific GPU ID (default: all GPUs)
include_processesbooleanNoInclude running process information (default: false)
include_temperaturebooleanNoInclude temperature sensors (default: true)
include_powerbooleanNoInclude power consumption data (default: true)
include_custombooleanNoInclude custom application metrics (default: true)

Available Metrics

CategoryMetricsDescription
gpuutilization, memory_used, memory_total, temperature, powerGPU hardware metrics
cpuutilization, load_avg, memory_used, memory_totalCPU and system memory
storagedisk_used, disk_total, disk_io_read, disk_io_writeStorage and I/O metrics
networkbytes_sent, bytes_received, packets_sent, packets_receivedNetwork traffic
applicationcustom_metrics, process_metricsApplication-specific metrics

Request Examples

# Get all current metrics
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get GPU metrics only with process information
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics?metrics=gpu&include_processes=true" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get metrics for specific GPU
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics?gpu_id=0" \
  -H "Authorization: Bearer YOUR_API_KEY"

Historical Metrics

Query Parameters

ParameterTypeRequiredDescription
start_timestringYesStart time (ISO 8601 format)
end_timestringYesEnd time (ISO 8601 format)
intervalstringNoData point interval: 1m, 5m, 15m, 1h, 6h, 1d (default: 5m)
metricsarrayNoSpecific metrics to retrieve
aggregationstringNoAggregation method: avg, min, max, sum (default: avg)
gpu_idstringNoSpecific GPU ID
# Get last 24 hours of GPU metrics
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics/historical?start_time=2024-01-14T00:00:00Z&end_time=2024-01-15T00:00:00Z&metrics=gpu&interval=1h" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response Schema

{
  "success": true,
  "data": {
    "cluster_id": "cluster_abc123",
    "timestamp": "2024-01-15T16:30:00Z",
    "gpu": {
      "utilization": {
        "0": 87.5,
        "1": 89.2,
        "2": 85.8,
        "3": 88.1
      },
      "memory": {
        "0": {
          "used": 68719476736,
          "total": 85899345920,
          "utilization_percent": 80.0
        },
        "1": {
          "used": 70866960384,
          "total": 85899345920,
          "utilization_percent": 82.5
        }
      },
      "temperature": {
        "0": 76.0,
        "1": 78.5,
        "2": 75.2,
        "3": 77.8
      },
      "power": {
        "0": 285.0,
        "1": 298.5,
        "2": 280.2,
        "3": 292.1
      },
      "clock_speeds": {
        "0": {
          "graphics_clock": 1755,
          "memory_clock": 877
        }
      }
    },
    "cpu": {
      "utilization": 45.2,
      "load_avg": [2.45, 2.12, 1.98],
      "memory_used": 137438953472,
      "memory_total": 274877906944,
      "cores": 32,
      "threads": 64
    },
    "storage": {
      "disk_used": 429496729600,
      "disk_total": 1099511627776,
      "utilization_percent": 39.1,
      "io_stats": {
        "read_bytes_per_sec": 52428800,
        "write_bytes_per_sec": 31457280,
        "read_ops_per_sec": 128,
        "write_ops_per_sec": 76
      }
    },
    "network": {
      "bytes_sent": 2684354560,
      "bytes_received": 1073741824,
      "packets_sent": 2048000,
      "packets_received": 1024000,
      "interfaces": [
        {
          "name": "eth0",
          "bytes_sent": 2684354560,
          "bytes_received": 1073741824,
          "speed_mbps": 1000,
          "status": "up"
        }
      ]
    },
    "processes": [
      {
        "pid": 1234,
        "name": "python",
        "command": "python train.py --model gpt2",
        "cpu_percent": 15.2,
        "memory_bytes": 8589934592,
        "gpu_memory_usage": 6442450944,
        "gpu_utilization": 85.0,
        "user": "ml_user",
        "runtime_seconds": 14400
      }
    ],
    "custom_metrics": {
      "training_loss": 0.234,
      "learning_rate": 0.001,
      "batch_size": 32,
      "epoch": 15,
      "samples_per_second": 1250.5
    }
  },
  "meta": {
    "collection_time_ms": 45,
    "metrics_version": "2.1.0"
  }
}

Custom Metrics

Create and track application-specific metrics for your workloads.
# Create custom training metric
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/metrics/custom" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "training_accuracy",
    "description": "Model training accuracy percentage",
    "type": "gauge",
    "unit": "percent",
    "tags": ["training", "ml", "accuracy"],
    "value": 94.5,
    "timestamp": "2024-01-15T16:30:00Z"
  }'

Use Cases

Training Performance Monitoring

Monitor ML training jobs with comprehensive metrics.
class TrainingMonitor:
    def __init__(self, cluster_id, experiment_name):
        self.cluster_id = cluster_id
        self.experiment_name = experiment_name
        self.metrics_history = []
        
    def log_training_step(self, step, loss, accuracy, learning_rate):
        """Log metrics for a training step"""
        
        # Get current system metrics
        system_metrics = get_cluster_metrics(self.cluster_id, ["gpu", "cpu"])
        
        # Create custom training metrics
        training_metrics = {
            "training_loss": loss,
            "training_accuracy": accuracy,
            "learning_rate": learning_rate,
            "training_step": step
        }
        
        # Log custom metrics
        for name, value in training_metrics.items():
            create_custom_metric(
                self.cluster_id,
                f"{self.experiment_name}_{name}",
                value,
                tags=["training", self.experiment_name]
            )
        
        # Store for local analysis
        self.metrics_history.append({
            "step": step,
            "timestamp": time.time(),
            "training": training_metrics,
            "system": system_metrics.get("data", {}) if system_metrics.get("success") else {}
        })
        
        # Check for performance issues
        self.check_performance_alerts(system_metrics.get("data", {}))
    
    def check_performance_alerts(self, system_metrics):
        """Check for performance issues and alert"""
        
        alerts = []
        
        # Check GPU utilization
        if "gpu" in system_metrics:
            gpu_utils = system_metrics["gpu"].get("utilization", {})
            avg_util = sum(gpu_utils.values()) / len(gpu_utils) if gpu_utils else 0
            
            if avg_util < 70:
                alerts.append(f"Low GPU utilization: {avg_util:.1f}%")
            
            # Check GPU temperature
            gpu_temps = system_metrics["gpu"].get("temperature", {})
            max_temp = max(gpu_temps.values()) if gpu_temps else 0
            
            if max_temp > 83:
                alerts.append(f"High GPU temperature: {max_temp}°C")
        
        # Check memory usage
        if "cpu" in system_metrics:
            memory_used = system_metrics["cpu"].get("memory_used", 0)
            memory_total = system_metrics["cpu"].get("memory_total", 1)
            memory_percent = (memory_used / memory_total) * 100
            
            if memory_percent > 90:
                alerts.append(f"High memory usage: {memory_percent:.1f}%")
        
        if alerts:
            print(f"⚠️  Performance alerts for {self.experiment_name}:")
            for alert in alerts:
                print(f"   • {alert}")
    
    def get_training_summary(self):
        """Get training performance summary"""
        
        if not self.metrics_history:
            return {"error": "No metrics history available"}
        
        steps = [m["step"] for m in self.metrics_history]
        losses = [m["training"]["training_loss"] for m in self.metrics_history]
        accuracies = [m["training"]["training_accuracy"] for m in self.metrics_history]
        
        gpu_utils = []
        for m in self.metrics_history:
            if "gpu" in m["system"] and "utilization" in m["system"]["gpu"]:
                utils = m["system"]["gpu"]["utilization"].values()
                avg_util = sum(utils) / len(utils) if utils else 0
                gpu_utils.append(avg_util)
        
        return {
            "experiment": self.experiment_name,
            "total_steps": max(steps) if steps else 0,
            "final_loss": losses[-1] if losses else None,
            "final_accuracy": accuracies[-1] if accuracies else None,
            "loss_improvement": losses[0] - losses[-1] if len(losses) > 1 else 0,
            "avg_gpu_utilization": sum(gpu_utils) / len(gpu_utils) if gpu_utils else 0,
            "training_efficiency": self.calculate_efficiency()
        }
    
    def calculate_efficiency(self):
        """Calculate training efficiency score"""
        
        if not self.metrics_history:
            return 0
        
        # Efficiency based on GPU utilization and loss improvement
        gpu_utils = []
        for m in self.metrics_history:
            if "gpu" in m["system"] and "utilization" in m["system"]["gpu"]:
                utils = m["system"]["gpu"]["utilization"].values()
                avg_util = sum(utils) / len(utils) if utils else 0
                gpu_utils.append(avg_util)
        
        avg_gpu_util = sum(gpu_utils) / len(gpu_utils) if gpu_utils else 0
        
        losses = [m["training"]["training_loss"] for m in self.metrics_history]
        loss_improvement = (losses[0] - losses[-1]) / losses[0] if len(losses) > 1 and losses[0] > 0 else 0
        
        # Efficiency score (0-100)
        efficiency = (avg_gpu_util * 0.6 + loss_improvement * 100 * 0.4)
        return min(100, max(0, efficiency))

# Usage
monitor = TrainingMonitor("cluster_training_001", "gpt2_finetuning")

# During training loop
for step in range(1000):
    # ... training code ...
    loss = train_step()  # Your training function
    accuracy = evaluate()  # Your evaluation function
    lr = optimizer.param_groups[0]['lr']
    
    # Log metrics every 10 steps
    if step % 10 == 0:
        monitor.log_training_step(step, loss, accuracy, lr)

# Get final summary
summary = monitor.get_training_summary()
print(f"Training completed with {summary['training_efficiency']:.1f}% efficiency")

Cost Optimization Analytics

Analyze metrics to optimize cluster costs and performance.
class CostOptimizationAnalyzer {
  constructor(clusterId) {
    this.clusterId = clusterId;
  }
  
  async analyzeWeeklyUsage() {
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - 7 * 24 * 60 * 60 * 1000); // 7 days ago
    
    const metrics = await this.getHistoricalMetrics(startTime, endTime, '1h');
    
    if (!metrics.success) {
      throw new Error('Failed to get metrics for analysis');
    }
    
    const analysis = {
      period: {
        start: startTime.toISOString(),
        end: endTime.toISOString(),
        duration_hours: 168 // 7 days
      },
      utilization_analysis: this.analyzeUtilization(metrics.data.data_points),
      cost_efficiency: this.calculateCostEfficiency(metrics.data.data_points),
      optimization_recommendations: []
    };
    
    // Generate recommendations
    analysis.optimization_recommendations = this.generateRecommendations(analysis);
    
    return analysis;
  }
  
  analyzeUtilization(dataPoints) {
    const gpuUtils = [];
    const cpuUtils = [];
    const lowUtilizationHours = [];
    
    dataPoints.forEach((point, index) => {
      if (point.gpu?.utilization) {
        const avgGpuUtil = Object.values(point.gpu.utilization)
          .reduce((sum, util) => sum + util, 0) / Object.keys(point.gpu.utilization).length;
        gpuUtils.push(avgGpuUtil);
        
        if (avgGpuUtil < 20) {
          lowUtilizationHours.push({
            timestamp: point.timestamp,
            gpu_utilization: avgGpuUtil,
            hour_of_week: new Date(point.timestamp).getHours()
          });
        }
      }
      
      if (point.cpu?.utilization) {
        cpuUtils.push(point.cpu.utilization);
      }
    });
    
    return {
      gpu: {
        average: gpuUtils.reduce((sum, util) => sum + util, 0) / gpuUtils.length,
        min: Math.min(...gpuUtils),
        max: Math.max(...gpuUtils),
        low_utilization_hours: lowUtilizationHours.length,
        efficiency_score: this.calculateEfficiencyScore(gpuUtils)
      },
      cpu: {
        average: cpuUtils.reduce((sum, util) => sum + util, 0) / cpuUtils.length,
        min: Math.min(...cpuUtils),
        max: Math.max(...cpuUtils)
      }
    };
  }
  
  calculateEfficiencyScore(utilizations) {
    // Efficiency score based on how much time is spent at good utilization levels
    const goodUtilization = utilizations.filter(util => util >= 70 && util <= 95);
    return (goodUtilization.length / utilizations.length) * 100;
  }
  
  calculateCostEfficiency(dataPoints) {
    // Assume hourly rate is available in cluster info
    const hourlyRate = 8.50; // Would get this from cluster info
    
    const totalHours = dataPoints.length;
    const productiveHours = dataPoints.filter(point => {
      if (!point.gpu?.utilization) return false;
      const avgUtil = Object.values(point.gpu.utilization)
        .reduce((sum, util) => sum + util, 0) / Object.keys(point.gpu.utilization).length;
      return avgUtil >= 50; // Consider 50%+ utilization as productive
    }).length;
    
    const totalCost = totalHours * hourlyRate;
    const productiveCost = productiveHours * hourlyRate;
    const wastedCost = totalCost - productiveCost;
    
    return {
      total_cost: totalCost,
      productive_cost: productiveCost,
      wasted_cost: wastedCost,
      cost_efficiency_percent: (productiveCost / totalCost) * 100,
      potential_monthly_savings: (wastedCost / 7) * 30 // Weekly to monthly
    };
  }
  
  generateRecommendations(analysis) {
    const recommendations = [];
    
    // Low utilization recommendation
    if (analysis.utilization_analysis.gpu.average < 60) {
      recommendations.push({
        type: 'utilization',
        priority: 'high',
        title: 'Low GPU Utilization Detected',
        description: `Average GPU utilization is ${analysis.utilization_analysis.gpu.average.toFixed(1)}%`,
        suggestion: 'Consider using a smaller GPU type or implementing auto-scaling',
        potential_savings: analysis.cost_efficiency.potential_monthly_savings
      });
    }
    
    // Idle time recommendation
    if (analysis.utilization_analysis.gpu.low_utilization_hours > 24) {
      recommendations.push({
        type: 'scheduling',
        priority: 'medium',
        title: 'Extended Idle Periods',
        description: `${analysis.utilization_analysis.gpu.low_utilization_hours} hours of low utilization detected`,
        suggestion: 'Implement auto-termination or scheduled shutdown during idle periods',
        potential_savings: (analysis.utilization_analysis.gpu.low_utilization_hours * 8.50) * 4 // Weekly to monthly
      });
    }
    
    // Efficiency recommendation
    if (analysis.utilization_analysis.gpu.efficiency_score < 70) {
      recommendations.push({
        type: 'optimization',
        priority: 'medium',
        title: 'Low Efficiency Score',
        description: `GPU efficiency score is ${analysis.utilization_analysis.gpu.efficiency_score.toFixed(1)}%`,
        suggestion: 'Review workload distribution and consider batch size optimization',
        potential_improvement: '20-30% efficiency gain possible'
      });
    }
    
    return recommendations;
  }
  
  async getHistoricalMetrics(startTime, endTime, interval) {
    const params = new URLSearchParams({
      start_time: startTime.toISOString(),
      end_time: endTime.toISOString(),
      interval: interval,
      metrics: 'gpu,cpu'
    });
    
    const response = await fetch(
      `https://api.tensorone.ai/v1/clusters/${this.clusterId}/metrics/historical?${params}`,
      {
        headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
      }
    );
    
    return await response.json();
  }
}

// Usage
const analyzer = new CostOptimizationAnalyzer('cluster_abc123');
const analysis = await analyzer.analyzeWeeklyUsage();

console.log('Weekly Cost Analysis:');
console.log(`Total Cost: $${analysis.cost_efficiency.total_cost.toFixed(2)}`);
console.log(`Wasted Cost: $${analysis.cost_efficiency.wasted_cost.toFixed(2)}`);
console.log(`Efficiency: ${analysis.cost_efficiency.cost_efficiency_percent.toFixed(1)}%`);

analysis.optimization_recommendations.forEach((rec, index) => {
  console.log(`\nRecommendation ${index + 1}: ${rec.title}`);
  console.log(`Priority: ${rec.priority}`);
  console.log(`Description: ${rec.description}`);
  console.log(`Suggestion: ${rec.suggestion}`);
  if (rec.potential_savings) {
    console.log(`Potential Monthly Savings: $${rec.potential_savings.toFixed(2)}`);
  }
});

Error Handling

{
  "success": false,
  "error": {
    "code": "METRICS_NOT_AVAILABLE",
    "message": "Metrics collection is not available for this cluster",
    "details": {
      "cluster_status": "stopped",
      "reason": "Metrics are only available for running clusters",
      "suggestion": "Start the cluster to begin metrics collection"
    }
  }
}

Best Practices

  1. Regular Monitoring: Set up automated monitoring for critical metrics
  2. Alert Thresholds: Configure appropriate alert thresholds for your workloads
  3. Historical Analysis: Use historical data for capacity planning and optimization
  4. Custom Metrics: Track application-specific metrics for better insights
  5. Cost Optimization: Regular analysis of utilization metrics for cost savings
  6. Performance Tuning: Use metrics to identify and resolve performance bottlenecks

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id
string
required

Response

200 - application/json

Cluster metrics

The response is of type object.