Restart Cluster

Overview

The Restart Cluster endpoint allows you to restart running or stopped GPU clusters, optionally applying configuration updates during the restart process. This is useful for applying system updates, changing configurations, or recovering from errors while preserving data and work state.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/restart

Path Parameters

Parameter	Type	Required	Description
`cluster_id`	string	Yes	Unique cluster identifier

Request Body

Parameter	Type	Required	Description
`force`	boolean	No	Force restart without graceful shutdown (default: false)
`grace_period_minutes`	integer	No	Grace period for graceful shutdown (default: 5, max: 30)
`wait_for_ready`	boolean	No	Wait for cluster to be fully ready after restart (default: false)
`timeout_minutes`	integer	No	Maximum wait time for completion (default: 15, max: 60)
`preserve_state`	boolean	No	Preserve running processes and state (default: false)
`update_configuration`	object	No	Configuration updates to apply during restart
`environment_updates`	object	No	Environment variable updates
`port_mapping_updates`	array	No	Port mapping changes
`restart_reason`	string	No	Reason for restart (for audit logs)
`restore_from_snapshot`	string	No	Snapshot ID to restore from during restart
`update_docker_image`	string	No	New Docker image to use after restart
`apply_system_updates`	boolean	No	Apply pending system updates (default: false)

Configuration Updates

{
  "update_configuration": {
    "cpu_cores": 64,              // Update CPU allocation
    "memory_gb": 512,             // Update memory allocation  
    "storage_gb": 2000,           // Expand storage (cannot shrink)
    "gpu_count": 8,               // Update GPU count (if available)
    "region": "us-west-2"         // Change region (requires data migration)
  }
}

Request Examples

# Basic cluster restart
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "timeout_minutes": 20,
    "restart_reason": "Apply system updates"
  }'

# Restart with configuration updates
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "update_configuration": {
      "memory_gb": 512,
      "cpu_cores": 64
    },
    "update_docker_image": "tensorone/pytorch:2.2-cuda12.1",
    "environment_updates": {
      "MODEL_VERSION": "v2.0",
      "BATCH_SIZE": "128"
    },
    "apply_system_updates": true,
    "restart_reason": "Scale up for large model training"
  }'

# Force restart with snapshot restore
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/restart" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "force": true,
    "wait_for_ready": true,
    "restore_from_snapshot": "snap_xyz789",
    "port_mapping_updates": [
      {
        "internal_port": 8080,
        "external_port": 0,
        "protocol": "tcp",
        "description": "Web Interface"
      }
    ],
    "restart_reason": "Restore from backup after error"
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "restarting",
    "restart_initiated_at": "2024-01-15T17:00:00Z",
    "estimated_completion_time": "2024-01-15T17:08:00Z",
    "restart_type": "graceful",
    "restart_progress": {
      "phase": "stopping_processes",
      "percentage": 25,
      "current_step": "Gracefully stopping running processes",
      "steps_completed": 2,
      "total_steps": 8
    },
    "configuration_updates": {
      "pending": [
        "memory_gb: 256 -> 512",
        "docker_image: pytorch:2.1 -> pytorch:2.2",
        "cpu_cores: 32 -> 64"
      ],
      "applied": []
    },
    "cost_impact": {
      "old_hourly_rate": 8.50,
      "new_hourly_rate": 12.50,
      "rate_change_reason": "Memory and CPU upgrade"
    }
  },
  "meta": {
    "request_id": "req_restart_789",
    "wait_for_ready": false
  }
}

Restart Progress Phases

Phase	Description	Typical Duration
`stopping_processes`	Gracefully stopping running processes	1-10 minutes
`creating_checkpoint`	Creating state checkpoint (if preserve_state=true)	30s-5 minutes
`updating_configuration`	Applying hardware/software updates	1-3 minutes
`starting_services`	Starting system services and Docker containers	30s-2 minutes
`health_checks`	Running health checks and validation	30s-1 minute
`ready`	Cluster is fully operational	-

Use Cases

System Updates and Maintenance

Apply system updates and security patches with minimal downtime.

def apply_system_updates(cluster_ids, maintenance_window):
    """Apply system updates during maintenance window"""
    
    results = []
    
    for cluster_id in cluster_ids:
        # Get current cluster configuration
        cluster_info = requests.get(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        ).json()
        
        if not cluster_info["success"]:
            results.append({"cluster_id": cluster_id, "success": False, "error": "Failed to get cluster info"})
            continue
        
        cluster = cluster_info["data"]
        
        # Determine restart strategy based on cluster type
        restart_config = {
            "wait_for_ready": True,
            "timeout_minutes": 20,
            "apply_system_updates": True,
            "restart_reason": f"Maintenance window: {maintenance_window['id']}"
        }
        
        # Production clusters get more graceful treatment
        if "prod" in cluster["name"].lower():
            restart_config["grace_period_minutes"] = 15
            restart_config["preserve_state"] = True
        else:
            restart_config["grace_period_minutes"] = 5
        
        # Apply any pending Docker image updates
        if maintenance_window.get("docker_updates", {}).get(cluster_id):
            restart_config["update_docker_image"] = maintenance_window["docker_updates"][cluster_id]
        
        try:
            result = restart_cluster_basic(cluster_id, restart_config["restart_reason"])
            
            results.append({
                "cluster_id": cluster_id,
                "name": cluster["name"],
                "success": result["success"],
                "downtime_seconds": result.get("data", {}).get("downtime_seconds"),
                "updates_applied": result.get("data", {}).get("configuration_changes", [])
            })
            
        except Exception as e:
            results.append({
                "cluster_id": cluster_id,
                "success": False,
                "error": str(e)
            })
    
    return {
        "maintenance_window": maintenance_window["id"],
        "total_clusters": len(cluster_ids),
        "successful_updates": len([r for r in results if r["success"]]),
        "results": results
    }

Model Version Deployment

Deploy new model versions with configuration updates.

async function deployModelVersion(clusterId, modelDeployment) {
  const deploymentConfig = {
    wait_for_ready: true,
    timeout_minutes: 30,
    update_docker_image: modelDeployment.dockerImage,
    environment_updates: {
      MODEL_VERSION: modelDeployment.version,
      MODEL_PATH: modelDeployment.modelPath,
      CONFIG_PATH: modelDeployment.configPath,
      DEPLOYMENT_ID: modelDeployment.deploymentId
    },
    restart_reason: `Model deployment: ${modelDeployment.name} v${modelDeployment.version}`
  };
  
  // Add hardware scaling if specified
  if (modelDeployment.scaling) {
    deploymentConfig.update_configuration = {
      gpu_count: modelDeployment.scaling.gpuCount,
      memory_gb: modelDeployment.scaling.memoryGb,
      cpu_cores: modelDeployment.scaling.cpuCores
    };
  }
  
  // Add port mappings for new services
  if (modelDeployment.services) {
    deploymentConfig.port_mapping_updates = modelDeployment.services.map(service => ({
      internal_port: service.port,
      external_port: 0, // Auto-assign
      protocol: 'tcp',
      description: service.description
    }));
  }
  
  try {
    console.log(`Deploying ${modelDeployment.name} v${modelDeployment.version} to cluster ${clusterId}`);
    
    const result = await restartWithUpdates(clusterId, deploymentConfig);
    
    // Validate deployment
    const healthCheck = await validateModelDeployment(clusterId, modelDeployment);
    
    if (!healthCheck.success) {
      throw new Error(`Model deployment validation failed: ${healthCheck.error}`);
    }
    
    return {
      success: true,
      clusterId: clusterId,
      deploymentId: modelDeployment.deploymentId,
      modelVersion: modelDeployment.version,
      proxyUrl: result.proxyUrl,
      services: result.portMappings || [],
      deploymentTime: result.restartDuration,
      healthCheck: healthCheck
    };
    
  } catch (error) {
    console.error(`Model deployment failed for cluster ${clusterId}:`, error);
    
    // Attempt rollback if previous version info is available
    if (modelDeployment.previousVersion) {
      console.log('Attempting rollback to previous version...');
      try {
        await deployModelVersion(clusterId, {
          ...modelDeployment,
          version: modelDeployment.previousVersion.version,
          dockerImage: modelDeployment.previousVersion.dockerImage,
          deploymentId: `rollback_${modelDeployment.deploymentId}`
        });
      } catch (rollbackError) {
        console.error('Rollback failed:', rollbackError);
      }
    }
    
    throw error;
  }
}

async function validateModelDeployment(clusterId, modelDeployment) {
  // Wait for services to be ready
  await new Promise(resolve => setTimeout(resolve, 30000));
  
  try {
    // Get cluster info to check services
    const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
      headers: { 'Authorization': 'Bearer YOUR_API_KEY' }
    });
    
    const cluster = await response.json();
    
    if (!cluster.success) {
      return { success: false, error: 'Failed to get cluster info' };
    }
    
    // Check if all expected services are running
    const runningServices = cluster.data.network.port_mappings || [];
    const expectedServices = modelDeployment.services || [];
    
    for (const expectedService of expectedServices) {
      const runningService = runningServices.find(s => s.internal_port === expectedService.port);
      
      if (!runningService || runningService.status !== 'active') {
        return { 
          success: false, 
          error: `Service ${expectedService.description} is not active` 
        };
      }
      
      // Test service endpoint if health check URL provided
      if (expectedService.healthCheckPath) {
        try {
          const serviceUrl = `${runningService.url}${expectedService.healthCheckPath}`;
          const healthResponse = await fetch(serviceUrl, { timeout: 10000 });
          
          if (!healthResponse.ok) {
            return {
              success: false,
              error: `Service ${expectedService.description} health check failed`
            };
          }
        } catch (error) {
          return {
            success: false,
            error: `Service ${expectedService.description} is not responding`
          };
        }
      }
    }
    
    return {
      success: true,
      servicesValidated: expectedServices.length,
      allServicesHealthy: true
    };
    
  } catch (error) {
    return {
      success: false,
      error: `Validation failed: ${error.message}`
    };
  }
}

Recovery from Errors

Restart clusters to recover from errors with optional state restoration.

def recover_cluster_from_error(cluster_id, recovery_strategy="restart"):
    """Recover cluster from error state using specified strategy"""
    
    # Get current cluster state
    cluster_info = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()
    
    if not cluster_info["success"]:
        return {"success": False, "error": "Failed to get cluster info"}
    
    cluster = cluster_info["data"]
    
    if cluster["status"] != "error":
        return {"success": False, "error": f"Cluster is not in error state (current: {cluster['status']})"}
    
    recovery_actions = []
    
    if recovery_strategy == "restart":
        # Simple restart with force
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "apply_system_updates": True,
            "restart_reason": "Error recovery - forced restart"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append("Forced restart")
        
    elif recovery_strategy == "restore_snapshot":
        # Find latest snapshot and restore
        snapshots = cluster.get("storage", {}).get("snapshots", [])
        
        if not snapshots:
            return {"success": False, "error": "No snapshots available for restore"}
        
        latest_snapshot = max(snapshots, key=lambda s: s["created_at"])
        
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "restore_from_snapshot": latest_snapshot["id"],
            "restart_reason": f"Error recovery - restore from snapshot {latest_snapshot['id']}"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append(f"Restored from snapshot: {latest_snapshot['name']}")
        
    elif recovery_strategy == "rebuild":
        # Restart with fresh image and reset configuration
        original_template = cluster.get("template_info", {})
        
        restart_config = {
            "force": True,
            "wait_for_ready": True,
            "update_docker_image": original_template.get("docker_image"),
            "environment_updates": {
                "RECOVERY_MODE": "true",
                "RECOVERY_TIMESTAMP": datetime.now().isoformat()
            },
            "restart_reason": "Error recovery - rebuild from template"
        }
        
        result = requests.post(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}/restart",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=restart_config
        ).json()
        
        recovery_actions.append("Rebuilt from original template")
    
    if result["success"]:
        return {
            "success": True,
            "cluster_id": cluster_id,
            "recovery_strategy": recovery_strategy,
            "recovery_actions": recovery_actions,
            "restart_duration": result.get("data", {}).get("restart_duration_seconds"),
            "new_status": result.get("data", {}).get("status"),
            "recovery_completed_at": result.get("data", {}).get("restart_completed_at")
        }
    
    return result

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster cannot be restarted in current state",
    "details": {
      "current_status": "starting",
      "allowed_states": ["running", "stopped", "error"],
      "suggestion": "Wait for cluster to finish current operation"
    }
  }
}

Security Considerations

State Preservation: Be cautious when preserving state during security updates
Configuration Validation: Validate all configuration changes before restart
Access Control: Ensure proper permissions for configuration modifications
Audit Logging: Log restart reasons and configuration changes for compliance

Best Practices

Graceful Restarts: Use graceful restarts unless emergency recovery is needed
Configuration Testing: Test configuration changes in development first
State Management: Use snapshots for critical workloads before major changes
Monitoring: Monitor restart progress and validate successful completion
Rollback Planning: Have rollback procedures ready for failed deployments
Resource Planning: Consider resource availability during restart operations

Authorizations

Authorization

string

header

required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id

string

required

Response

200 - application/json

Cluster restart initiated

The response is of type object.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Overview

Endpoint

Path Parameters

Request Body

Configuration Updates

Request Examples

Response Schema

Restart Progress Phases

Use Cases

System Updates and Maintenance

Model Version Deployment

Recovery from Errors

Error Handling

Security Considerations

Best Practices

Authorizations

Path Parameters

Response

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Overview

​Endpoint

​Path Parameters

​Request Body

​Configuration Updates

​Request Examples

​Response Schema

​Restart Progress Phases

​Use Cases

​System Updates and Maintenance

​Model Version Deployment

​Recovery from Errors

​Error Handling

​Security Considerations

​Best Practices

Authorizations

Path Parameters

Response

Overview

Endpoint

Path Parameters

Request Body

Configuration Updates

Request Examples

Response Schema

Restart Progress Phases

Use Cases

System Updates and Maintenance

Model Version Deployment

Recovery from Errors

Error Handling

Security Considerations

Best Practices