Restart Cluster
curl --request POST \
  --url https://api.tensorone.ai/v2/clusters/{cluster_id}/restart \
  --header 'Authorization: <api-key>'
{
  "id": "<string>",
  "name": "<string>",
  "status": "running",
  "gpuType": "<string>",
  "containerDiskSize": 123,
  "volumeSize": 123,
  "createdAt": "2023-11-07T05:31:56Z"
}

Overview

The Restart Cluster endpoint allows you to restart running or stopped GPU clusters, optionally applying configuration updates during the restart process. This is useful for applying system updates, changing configurations, or recovering from errors while preserving data and work state.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/restart

Path Parameters

ParameterTypeRequiredDescription
cluster_idstringYesUnique cluster identifier

Request Body

ParameterTypeRequiredDescription
forcebooleanNoForce restart without graceful shutdown (default: false)
grace_period_minutesintegerNoGrace period for graceful shutdown (default: 5, max: 30)
wait_for_readybooleanNoWait for cluster to be fully ready after restart (default: false)
timeout_minutesintegerNoMaximum wait time for completion (default: 15, max: 60)
preserve_statebooleanNoPreserve running processes and state (default: false)
update_configurationobjectNoConfiguration updates to apply during restart
environment_updatesobjectNoEnvironment variable updates
port_mapping_updatesarrayNoPort mapping changes
restart_reasonstringNoReason for restart (for audit logs)
restore_from_snapshotstringNoSnapshot ID to restore from during restart
update_docker_imagestringNoNew Docker image to use after restart
apply_system_updatesbooleanNoApply pending system updates (default: false)

Configuration Updates

{
  "update_configuration": {
    "cpu_cores": 64,              // Update CPU allocation
    "memory_gb": 512,             // Update memory allocation  
    "storage_gb": 2000,           // Expand storage (cannot shrink)
    "gpu_count": 8,               // Update GPU count (if available)
    "region": "us-west-2"         // Change region (requires data migration)
  }
}

Request Examples

# Basic cluster restart
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "timeout_minutes": 20,
    "restart_reason": "Apply system updates"
  }'

# Restart with configuration updates
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "update_configuration": {
      "memory_gb": 512,
      "cpu_cores": 64
    },
    "update_docker_image": "tensorone/pytorch:2.2-cuda12.1",
    "environment_updates": {
      "MODEL_VERSION": "v2.0",
      "BATCH_SIZE": "128"
    },
    "apply_system_updates": true,
    "restart_reason": "Scale up for large model training"
  }'

# Force restart with snapshot restore
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "force": true,
    "wait_for_ready": true,
    "restore_from_snapshot": "snap_xyz789",
    "port_mapping_updates": [
      {
        "internal_port": 8080,
        "external_port": 0,
        "protocol": "tcp",
        "description": "Web Interface"
      }
    ],
    "restart_reason": "Restore from backup after error"
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "restarting",
    "restart_initiated_at": "2024-01-15T17:00:00Z",
    "estimated_completion_time": "2024-01-15T17:08:00Z",
    "restart_type": "graceful",
    "restart_progress": {
      "phase": "stopping_processes",
      "percentage": 25,
      "current_step": "Gracefully stopping running processes",
      "steps_completed": 2,
      "total_steps": 8
    },
    "configuration_updates": {
      "pending": [
        "memory_gb: 256 -> 512",
        "docker_image: pytorch:2.1 -> pytorch:2.2",
        "cpu_cores: 32 -> 64"
      ],
      "applied": []
    },
    "cost_impact": {
      "old_hourly_rate": 8.50,
      "new_hourly_rate": 12.50,
      "rate_change_reason": "Memory and CPU upgrade"
    }
  },
  "meta": {
    "request_id": "req_restart_789",
    "wait_for_ready": false
  }
}

Restart Progress Phases

PhaseDescriptionTypical Duration
stopping_processesGracefully stopping running processes1-10 minutes
creating_checkpointCreating state checkpoint (if preserve_state=true)30s-5 minutes
updating_configurationApplying hardware/software updates1-3 minutes
starting_servicesStarting system services and Docker containers30s-2 minutes
health_checksRunning health checks and validation30s-1 minute
readyCluster is fully operational-

Use Cases

System Updates and Maintenance

Apply system updates and security patches with minimal downtime.
def apply_system_updates(cluster_ids, maintenance_window):
    """Apply system updates during maintenance window"""
    
    results = []
    
    for cluster_id in cluster_ids:
        # Get current cluster configuration
        cluster_info = requests.get(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        ).json()
        
        if not cluster_info["success"]:
            results.append({"cluster_id": cluster_id, "success": False, "error": "Failed to get cluster info"})
            continue
        
        cluster = cluster_info["data"]
        
        # Determine restart strategy based on cluster type
        restart_config = {
            "wait_for_ready": True,
            "timeout_minutes": 20,
            "apply_system_updates": True,
            "restart_reason": f"Maintenance window: {maintenance_window['id']}"
        }
        
        # Production clusters get more graceful treatment
        if "prod" in cluster["name"].lower():
            restart_config["grace_period_minutes"] = 15
            restart_config["preserve_state"] = True
        else:
            restart_config["grace_period_minutes"] = 5
        
        # Apply any pending Docker image updates
        if maintenance_window.get("docker_updates", {}).get(cluster_id):
            restart_config["update_docker_image"] = maintenance_window["docker_updates"][cluster_id]
        
        try:
            result = restart_cluster_basic(cluster_id, restart_config["restart_reason"])
            
            results.append({
                "cluster_id": cluster_id,
                "name": cluster["name"],
                "success": result["success"],
                "downtime_seconds": result.get("data", {}).get("downtime_seconds"),
                "updates_applied": result.get("data", {}).get("configuration_changes", [])
            })
            
        except Exception as e:
            results.append({
                "cluster_id": cluster_id,
                "success": False,
                "error": str(e)
            })
    
    return {
        "maintenance_window": maintenance_window["id"],
        "total_clusters": len(cluster_ids),
        "successful_updates": len([r for r in results if r["success"]]),
        "results": results
    }

Model Version Deployment

Deploy new model versions with configuration updates.
async function deployModelVersion(clusterId, modelDeployment) {
  const deploymentConfig = {
    wait_for_ready: true,
    timeout_minutes: 30,
    update_docker_image: modelDeployment.dockerImage,
    environment_updates: {
      MODEL_VERSION: modelDeployment.version,
      MODEL_PATH: modelDeployment.modelPath,
      CONFIG_PATH: modelDeployment.configPath,
      DEPLOYMENT_ID: modelDeployment.deploymentId
    },
    restart_reason: `Model deployment: ${modelDeployment.name} v${modelDeployment.version}`
  };
  
  // Add hardware scaling if specified
  if (modelDeployment.scaling) {
    deploymentConfig.update_configuration = {
      gpu_count: modelDeployment.scaling.gpuCount,
      memory_gb: modelDeployment.scaling.memoryGb,
      cpu_cores: modelDeployment.scaling.cpuCores
    };
  }
  
  // Add port mappings for new services
  if (modelDeployment.services) {
    deploymentConfig.port_mapping_updates = modelDeployment.services.map(service => ({
      internal_port: service.port,
      external_port: 0, // Auto-assign
      protocol: 'tcp',
      description: service.description
    }));
  }
  
  try {
    console.log(`Deploying ${modelDeployment.name} v${modelDeployment.version} to cluster ${clusterId}`);
    
    const result = await restartWithUpdates(clusterId, deploymentConfig);
    
    // Validate deployment
    const healthCheck = await validateModelDeployment(clusterId, modelDeployment);
    
    if (!healthCheck.success) {
      throw new Error(`Model deployment validation failed: ${healthCheck.error}`);
    }
    
    return {
      success: true,
      clusterId: clusterId,
      deploymentId: modelDeployment.deploymentId,
      modelVersion: modelDeployment.version,
      proxyUrl: result.proxyUrl,
      services: result.portMappings || [],
      deploymentTime: result.restartDuration,
      healthCheck: healthCheck
    };
    
  } catch (error) {
    console.error(`Model deployment failed for cluster ${clusterId}:`, error);
    
    // Attempt rollback if previous version info is available
    if (modelDeployment.previousVersion) {
      console.log('Attempting rollback to previous version...');
      try {
        await deployModelVersion(clusterId, {
          ...modelDeployment,
          version: modelDeployment.previousVersion.version,
          dockerImage: modelDeployment.previousVersion.dockerImage,
          deploymentId: `rollback_${modelDeployment.deploymentId}`
        });
      } catch (rollbackError) {
        console.error('Rollback failed:', rollbackError);
      }
    }
    
    throw error;
  }
}

async function validateModelDeployment(clusterId, modelDeployment) {
  // Wait for services to be ready
  await new Promise(resolve => setTimeout(resolve, 30000));
  
  try {
    // Get cluster info to check services
    const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
      headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
    });
    
    const cluster = await response.json();
    
    if (!cluster.success) {
      return { success: false, error: 'Failed to get cluster info' };
    }
    
    // Check if all expected services are running
    const runningServices = cluster.data.network.port_mappings || [];
    const expectedServices = modelDeployment.services || [];
    
    for (const expectedService of expectedServices) {
      const runningService = runningServices.find(s => s.internal_port === expectedService.port);
      
      if (!runningService || runningService.status !== 'active') {
        return { 
          success: false, 
          error: `Service ${expectedService.description} is not active` 
        };
      }
      
      // Test service endpoint if health check URL provided
      if (expectedService.healthCheckPath) {
        try {
          const serviceUrl = `${runningService.url}${expectedService.healthCheckPath}`;
          const healthResponse = await fetch(serviceUrl, { timeout: 10000 });
          
          if (!healthResponse.ok) {
            return {
              success: false,
              error: `Service ${expectedService.description} health check failed`
            };
          }
        } catch (error) {
          return {
            success: false,
            error: `Service ${expectedService.description} is not responding`
          };
        }
      }
    }
    
    return {
      success: true,
      servicesValidated: expectedServices.length,
      allServicesHealthy: true
    };
    
  } catch (error) {
    return {
      success: false,
      error: `Validation failed: ${error.message}`
    };
  }
}

Recovery from Errors

Restart clusters to recover from errors with optional state restoration.
def recover_cluster_from_error(cluster_id, recovery_strategy="restart"):
    """Recover cluster from error state using specified strategy"""
    
    # Get current cluster state
    cluster_info = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()
    
    if not cluster_info["success"]:
        return {"success": False, "error": "Failed to get cluster info"}
    
    cluster = cluster_info["data"]
    
    if cluster["status"] != "error":
        return {"success": False, "error": f"Cluster is not in error state (current: {cluster['status']})"}
    
    recovery_actions = []
    
    if recovery_strategy == "restart":
        # Simple restart with force
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "apply_system_updates": True,
            "restart_reason": "Error recovery - forced restart"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append("Forced restart")
        
    elif recovery_strategy == "restore_snapshot":
        # Find latest snapshot and restore
        snapshots = cluster.get("storage", {}).get("snapshots", [])
        
        if not snapshots:
            return {"success": False, "error": "No snapshots available for restore"}
        
        latest_snapshot = max(snapshots, key=lambda s: s["created_at"])
        
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "restore_from_snapshot": latest_snapshot["id"],
            "restart_reason": f"Error recovery - restore from snapshot {latest_snapshot['id']}"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append(f"Restored from snapshot: {latest_snapshot['name']}")
        
    elif recovery_strategy == "rebuild":
        # Restart with fresh image and reset configuration
        original_template = cluster.get("template_info", {})
        
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "update_docker_image": original_template.get("docker_image"),
            "environment_updates": {
                "RECOVERY_MODE": "true",
                "RECOVERY_TIMESTAMP": datetime.now().isoformat()
            },
            "restart_reason": "Error recovery - rebuild from template"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append("Rebuilt from original template")
    
    if result["success"]:
        return {
            "success": True,
            "cluster_id": cluster_id,
            "recovery_strategy": recovery_strategy,
            "recovery_actions": recovery_actions,
            "restart_duration": result.get("data", {}).get("restart_duration_seconds"),
            "new_status": result.get("data", {}).get("status"),
            "recovery_completed_at": result.get("data", {}).get("restart_completed_at")
        }
    
    return result

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster cannot be restarted in current state",
    "details": {
      "current_status": "starting",
      "allowed_states": ["running", "stopped", "error"],
      "suggestion": "Wait for cluster to finish current operation"
    }
  }
}

Security Considerations

  • State Preservation: Be cautious when preserving state during security updates
  • Configuration Validation: Validate all configuration changes before restart
  • Access Control: Ensure proper permissions for configuration modifications
  • Audit Logging: Log restart reasons and configuration changes for compliance

Best Practices

  1. Graceful Restarts: Use graceful restarts unless emergency recovery is needed
  2. Configuration Testing: Test configuration changes in development first
  3. State Management: Use snapshots for critical workloads before major changes
  4. Monitoring: Monitor restart progress and validate successful completion
  5. Rollback Planning: Have rollback procedures ready for failed deployments
  6. Resource Planning: Consider resource availability during restart operations

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id
string
required

Response

200 - application/json

Cluster restart initiated

The response is of type object.