List Clusters
curl --request GET \
  --url https://api.tensorone.ai/v2/clusters \
  --header 'Authorization: <api-key>'
[
  {
    "id": "<string>",
    "name": "<string>",
    "status": "running",
    "gpuType": "<string>",
    "containerDiskSize": 123,
    "volumeSize": 123,
    "createdAt": "2023-11-07T05:31:56Z"
  }
]

Overview

The Get Cluster Details endpoint provides comprehensive information about a specific GPU cluster, including real-time status, performance metrics, network configuration, cost information, and connection details. Essential for monitoring and managing individual clusters.

Endpoint

GET https://api.tensorone.ai/v1/clusters/{cluster_id}

Path Parameters

ParameterTypeRequiredDescription
cluster_idstringYesUnique cluster identifier

Query Parameters

ParameterTypeRequiredDescription
include_metricsbooleanNoInclude real-time performance metrics (default: true)
include_logsbooleanNoInclude recent log entries (default: false)
include_cost_breakdownbooleanNoInclude detailed cost breakdown (default: true)
metrics_windowstringNoMetrics time window: 1h, 6h, 24h, 7d (default: 1h)

Request Examples

# Get basic cluster information
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

# Get cluster with detailed metrics and cost breakdown
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123?include_metrics=true&include_cost_breakdown=true&metrics_window=24h" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

# Get cluster with recent logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123?include_logs=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "description": "High-performance cluster for LLM training",
    "status": "running",
    "status_details": {
      "message": "Cluster is running normally",
      "last_status_change": "2024-01-15T14:35:00Z",
      "health_checks": {
        "gpu_health": "healthy",
        "storage_health": "healthy",
        "network_health": "healthy",
        "docker_health": "healthy"
      }
    },
    "configuration": {
      "gpu_type": "A100",
      "gpu_count": 4,
      "cpu_cores": 32,
      "memory_gb": 256,
      "storage_gb": 1000,
      "storage_type": "nvme",
      "region": "us-west-2",
      "availability_zone": "us-west-2a"
    },
    "project_info": {
      "project_id": "proj_456",
      "project_name": "ML Research Team",
      "owner_id": "user_789",
      "owner_email": "researcher@company.com"
    },
    "template_info": {
      "template_id": "tmpl_pytorch_latest",
      "template_name": "PyTorch 2.1 with CUDA 11.8",
      "template_version": "v2.1.0",
      "docker_image": "tensorone/pytorch:2.1-cuda11.8"
    },
    "network": {
      "private_ip": "10.0.1.15",
      "public_ip": "203.0.113.42",
      "proxy_url": "https://cluster-abc123.tensorone.ai",
      "ssh_connection": {
        "host": "ssh-abc123.tensorone.ai",
        "port": 22,
        "username": "root",
        "status": "connected",
        "connection_string": "ssh root@ssh-abc123.tensorone.ai",
        "last_connection": "2024-01-15T15:42:00Z"
      },
      "port_mappings": [
        {
          "internal_port": 8080,
          "external_port": 32001,
          "protocol": "tcp",
          "description": "Web Application",
          "url": "https://cluster-abc123.tensorone.ai:32001",
          "status": "active"
        },
        {
          "internal_port": 6006,
          "external_port": 32002,
          "protocol": "tcp",
          "description": "TensorBoard",
          "url": "https://cluster-abc123.tensorone.ai:32002",
          "status": "active"
        }
      ],
      "security_groups": ["sg_ml_training", "sg_ssh_access"],
      "firewall_rules": [
        {
          "direction": "inbound",
          "protocol": "tcp",
          "port_range": "22",
          "source": "0.0.0.0/0",
          "description": "SSH Access"
        }
      ]
    },
    "metrics": {
      "current": {
        "timestamp": "2024-01-15T15:45:00Z",
        "gpu_utilization": 87.3,
        "gpu_memory_utilization": 94.2,
        "cpu_utilization": 52.1,
        "memory_utilization": 68.4,
        "storage_utilization": 45.2,
        "network_rx_mbps": 125.3,
        "network_tx_mbps": 89.7,
        "temperature_celsius": 72.5,
        "power_usage_watts": 1250
      },
      "historical": {
        "window": "24h",
        "gpu_utilization": {
          "avg": 82.1,
          "min": 15.3,
          "max": 98.7,
          "trend": "increasing"
        },
        "memory_utilization": {
          "avg": 65.4,
          "min": 12.1,
          "max": 94.2,
          "trend": "stable"
        },
        "cost_efficiency": {
          "utilization_score": 85.2,
          "cost_per_compute_hour": 8.95
        }
      },
      "alerts": [
        {
          "type": "high_gpu_utilization",
          "severity": "info",
          "message": "GPU utilization consistently above 85%",
          "triggered_at": "2024-01-15T15:30:00Z"
        }
      ]
    },
    "cost": {
      "current_hourly_rate": 8.50,
      "currency": "USD",
      "session_cost": 68.25,
      "total_lifetime_cost": 284.75,
      "cost_breakdown": {
        "gpu_cost": 6.80,
        "cpu_cost": 0.85,
        "memory_cost": 0.45,
        "storage_cost": 0.25,
        "network_cost": 0.15
      },
      "billing_period": {
        "start": "2024-01-15T14:35:00Z",
        "current": "2024-01-15T15:45:00Z",
        "duration_hours": 1.17
      },
      "cost_projections": {
        "daily_estimate": 204.00,
        "weekly_estimate": 1428.00,
        "monthly_estimate": 6120.00
      }
    },
    "storage": {
      "volumes": [
        {
          "name": "root",
          "size_gb": 100,
          "used_gb": 45,
          "mount_path": "/",
          "type": "nvme",
          "encrypted": true
        },
        {
          "name": "data",
          "size_gb": 900,
          "used_gb": 230,
          "mount_path": "/data",
          "type": "nvme",
          "encrypted": true
        }
      ],
      "snapshots": [
        {
          "id": "snap_123",
          "name": "pre_training_snapshot",
          "size_gb": 45,
          "created_at": "2024-01-15T14:00:00Z"
        }
      ]
    },
    "environment": {
      "variables": {
        "CUDA_VISIBLE_DEVICES": "0,1,2,3",
        "NCCL_SOCKET_IFNAME": "eth0",
        "PYTHONPATH": "/workspace"
      },
      "secrets": ["WANDB_API_KEY", "HUGGINGFACE_TOKEN"],
      "runtime_info": {
        "python_version": "3.9.18",
        "cuda_version": "11.8",
        "driver_version": "520.61.05",
        "docker_version": "24.0.7"
      }
    },
    "auto_terminate": {
      "enabled": true,
      "idle_minutes": 60,
      "max_runtime_hours": 24,
      "cost_limit_usd": 500.0,
      "estimated_termination": "2024-01-16T14:35:00Z",
      "current_idle_minutes": 5
    },
    "uptime_seconds": 4200,
    "created_at": "2024-01-15T14:35:00Z",
    "updated_at": "2024-01-15T15:45:00Z",
    "last_accessed": "2024-01-15T15:42:00Z",
    "tags": {
      "team": "ml-research",
      "project": "llm-training",
      "environment": "production"
    }
  },
  "meta": {
    "request_id": "req_get_456",
    "response_time_ms": 89,
    "cache_hit": false,
    "data_freshness_seconds": 15
  }
}

Response Fields

Status Information

  • status: Current cluster state (starting, running, stopping, stopped, error)
  • status_details: Detailed status information and health checks
  • health_checks: Individual component health status

Configuration Details

  • configuration: Hardware and software configuration
  • template_info: Information about the template used
  • environment: Environment variables and runtime information

Network and Connectivity

  • network: Complete networking configuration
  • ssh_connection: SSH access details and status
  • port_mappings: Exposed ports and their URLs
  • proxy_url: Main cluster access URL

Performance Metrics

  • metrics.current: Real-time performance data
  • metrics.historical: Historical performance trends
  • metrics.alerts: Active performance alerts

Cost Information

  • cost.current_hourly_rate: Current billing rate
  • cost.cost_breakdown: Detailed cost components
  • cost.cost_projections: Future cost estimates

Use Cases

Cluster Health Monitoring

Monitor cluster health and performance for proactive management.
def check_cluster_health(cluster_id):
    response = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"include_metrics": True}
    )
    
    cluster = response.json()["data"]
    
    health_status = {
        "cluster_id": cluster_id,
        "status": cluster["status"],
        "healthy": True,
        "issues": []
    }
    
    # Check GPU utilization
    gpu_util = cluster["metrics"]["current"]["gpu_utilization"]
    if gpu_util < 10:
        health_status["issues"].append("Low GPU utilization - potential waste")
    elif gpu_util > 95:
        health_status["issues"].append("Very high GPU utilization - potential bottleneck")
    
    # Check temperature
    temp = cluster["metrics"]["current"]["temperature_celsius"]
    if temp > 85:
        health_status["issues"].append(f"High temperature: {temp}°C")
        health_status["healthy"] = False
    
    # Check cost efficiency
    hourly_cost = cluster["cost"]["current_hourly_rate"]
    if hourly_cost > 100:
        health_status["issues"].append(f"High hourly cost: ${hourly_cost}")
    
    return health_status

Connection Information Retrieval

Get connection details for accessing cluster services.
async function getClusterConnectionInfo(clusterId) {
  const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
    headers: {
      'Authorization': 'Bearer ' + API_KEY,
      'Content-Type': 'application/json'
    }
  });
  
  const cluster = await response.json();
  
  if (!cluster.success) {
    throw new Error(`Failed to get cluster info: ${cluster.error.message}`);
  }
  
  const data = cluster.data;
  
  return {
    cluster_id: clusterId,
    name: data.name,
    status: data.status,
    ssh: {
      host: data.network.ssh_connection.host,
      port: data.network.ssh_connection.port,
      username: data.network.ssh_connection.username,
      command: data.network.ssh_connection.connection_string
    },
    web_services: data.network.port_mappings.map(mapping => ({
      name: mapping.description,
      url: mapping.url,
      port: mapping.external_port,
      status: mapping.status
    })),
    proxy_url: data.network.proxy_url
  };
}

Cost Analysis and Optimization

Analyze cluster costs and identify optimization opportunities.
def analyze_cluster_costs(cluster_id):
    response = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "include_cost_breakdown": True,
            "include_metrics": True,
            "metrics_window": "7d"
        }
    )
    
    cluster = response.json()["data"]
    
    analysis = {
        "cluster_id": cluster_id,
        "current_hourly_cost": cluster["cost"]["current_hourly_rate"],
        "utilization_efficiency": {},
        "cost_optimization_suggestions": []
    }
    
    # Calculate cost efficiency
    gpu_util = cluster["metrics"]["historical"]["gpu_utilization"]["avg"]
    cost_per_gpu_hour = cluster["cost"]["cost_breakdown"]["gpu_cost"]
    
    analysis["utilization_efficiency"] = {
        "gpu_utilization_avg": gpu_util,
        "cost_per_useful_gpu_hour": cost_per_gpu_hour / (gpu_util / 100),
        "efficiency_score": min(gpu_util / 80 * 100, 100)  # 80% is target
    }
    
    # Generate optimization suggestions
    if gpu_util < 30:
        analysis["cost_optimization_suggestions"].append({
            "type": "downgrade_gpu",
            "message": f"Low GPU utilization ({gpu_util}%). Consider smaller GPU type.",
            "potential_savings_percent": 40
        })
    
    if cluster["uptime_seconds"] > 86400 and gpu_util < 20:  # 24 hours
        analysis["cost_optimization_suggestions"].append({
            "type": "auto_terminate",
            "message": "Long runtime with low utilization. Enable auto-termination.",
            "potential_savings_percent": 60
        })
    
    return analysis

Performance Monitoring Dashboard

Create real-time performance monitoring for multiple clusters.
class ClusterMonitor {
  constructor(clusterIds, apiKey) {
    this.clusterIds = clusterIds;
    this.apiKey = apiKey;
    this.metrics = new Map();
  }
  
  async updateMetrics() {
    const promises = this.clusterIds.map(async (clusterId) => {
      try {
        const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}?include_metrics=true`, {
          headers: {
            'Authorization': 'Bearer ' + this.apiKey,
            'Content-Type': 'application/json'
          }
        });
        
        const data = await response.json();
        
        if (data.success) {
          this.metrics.set(clusterId, {
            name: data.data.name,
            status: data.data.status,
            gpu_utilization: data.data.metrics.current.gpu_utilization,
            memory_utilization: data.data.metrics.current.memory_utilization,
            temperature: data.data.metrics.current.temperature_celsius,
            cost_per_hour: data.data.cost.current_hourly_rate,
            last_updated: new Date()
          });
        }
      } catch (error) {
        console.error(`Failed to update metrics for ${clusterId}:`, error);
      }
    });
    
    await Promise.all(promises);
    return this.metrics;
  }
  
  getAggregatedMetrics() {
    const clusters = Array.from(this.metrics.values());
    
    return {
      total_clusters: clusters.length,
      running_clusters: clusters.filter(c => c.status === 'running').length,
      average_gpu_utilization: clusters.reduce((sum, c) => sum + c.gpu_utilization, 0) / clusters.length,
      total_hourly_cost: clusters.reduce((sum, c) => sum + c.cost_per_hour, 0),
      high_temperature_count: clusters.filter(c => c.temperature > 80).length
    };
  }
}

Error Handling

{
  "success": false,
  "error": {
    "code": "CLUSTER_NOT_FOUND",
    "message": "Cluster with ID 'cluster_invalid' not found",
    "details": {
      "cluster_id": "cluster_invalid",
      "suggestion": "Verify the cluster ID and ensure you have access to this cluster"
    }
  }
}

Security Considerations

  • Access Control: Ensure proper permissions for cluster access
  • Sensitive Data: Cluster details may contain sensitive configuration information
  • API Key Security: Use secure storage for API keys with appropriate scopes
  • Network Security: Monitor exposed ports and access patterns

Best Practices

  1. Regular Monitoring: Check cluster health and metrics regularly
  2. Cost Awareness: Monitor costs and set up alerts for unexpected charges
  3. Performance Optimization: Use metrics to optimize cluster configurations
  4. Security Compliance: Regularly review access logs and security settings
  5. Resource Planning: Use historical data for future capacity planning

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Response

List of clusters

The response is of type object[].