Start Cluster
curl --request POST \
  --url https://api.tensorone.ai/v2/clusters/{cluster_id}/start \
  --header 'Authorization: <api-key>'
{
  "id": "<string>",
  "name": "<string>",
  "status": "running",
  "gpuType": "<string>",
  "containerDiskSize": 123,
  "volumeSize": 123,
  "createdAt": "2023-11-07T05:31:56Z"
}

Overview

The Start Cluster endpoint allows you to start previously stopped GPU clusters, optionally updating configuration settings during startup. This enables cost-effective cluster management by stopping clusters when not in use and quickly resuming work when needed.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/start

Path Parameters

ParameterTypeRequiredDescription
cluster_idstringYesUnique cluster identifier

Request Body

ParameterTypeRequiredDescription
wait_for_readybooleanNoWait for cluster to be fully ready before returning (default: false)
timeout_minutesintegerNoMaximum wait time in minutes (default: 10, max: 30)
update_configurationobjectNoConfiguration updates to apply during startup
restore_from_snapshotstringNoSnapshot ID to restore from
environment_updatesobjectNoEnvironment variable updates
port_mapping_updatesarrayNoPort mapping changes
auto_terminate_updatesobjectNoAuto-termination setting updates

Configuration Updates

{
  "update_configuration": {
    "cpu_cores": 64,              // Update CPU allocation
    "memory_gb": 512,             // Update memory allocation
    "storage_gb": 2000,           // Expand storage (cannot shrink)
    "docker_image": "new_image",  // Update Docker image
    "gpu_count": 8                // Update GPU count (if available)
  }
}

Request Examples

# Basic cluster start
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "timeout_minutes": 15
  }'

# Start with configuration updates
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "update_configuration": {
      "memory_gb": 512,
      "docker_image": "pytorch/pytorch:2.2-cuda12.1-devel"
    },
    "environment_updates": {
      "BATCH_SIZE": "64",
      "LEARNING_RATE": "0.001"
    },
    "auto_terminate_updates": {
      "enabled": true,
      "idle_minutes": 30
    }
  }'

# Start and restore from snapshot
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": false,
    "restore_from_snapshot": "snap_xyz789",
    "port_mapping_updates": [
      {
        "internal_port": 8888,
        "external_port": 0,
        "protocol": "tcp",
        "description": "Jupyter Lab"
      }
    ]
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "starting",
    "start_initiated_at": "2024-01-15T16:00:00Z",
    "estimated_ready_at": "2024-01-15T16:05:00Z",
    "configuration_updates_applied": [
      "memory_gb: 256 -> 512",
      "docker_image: pytorch:2.1 -> pytorch:2.2"
    ],
    "startup_progress": {
      "phase": "initializing",
      "percentage": 15,
      "current_step": "Allocating resources",
      "steps_completed": 2,
      "total_steps": 8
    },
    "cost": {
      "new_hourly_rate": 12.50,
      "previous_hourly_rate": 8.50,
      "rate_change_reason": "Memory upgrade"
    }
  },
  "meta": {
    "request_id": "req_start_123",
    "wait_for_ready": false
  }
}

Startup Progress Tracking

When wait_for_ready is false, you can poll the cluster status to track startup progress:
def track_startup_progress(cluster_id):
    """Track cluster startup progress with real-time updates"""
    
    import time
    
    while True:
        response = requests.get(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        
        cluster = response.json()["data"]
        status = cluster["status"]
        
        if status == "running":
            print(f"✅ Cluster {cluster_id} is ready!")
            print(f"Access URL: {cluster['network']['proxy_url']}")
            break
        elif status == "starting":
            # Show progress if available
            if "startup_progress" in cluster:
                progress = cluster["startup_progress"]
                print(f"🔄 {progress['current_step']} ({progress['percentage']}%)")
            else:
                print(f"🔄 Starting cluster... (Status: {status})")
        elif status == "error":
            print(f"❌ Cluster startup failed: {cluster.get('error_message', 'Unknown error')}")
            break
        
        time.sleep(10)  # Check every 10 seconds

Use Cases

Development Workflow

Start development clusters when team members begin work.
def start_dev_environment(user_id, project_id):
    """Start a development environment for a user"""
    
    # Find user's development cluster
    clusters_response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "project_id": project_id,
            "search": f"dev-{user_id}",
            "status": "stopped"
        }
    )
    
    clusters = clusters_response.json()["data"]["clusters"]
    
    if not clusters:
        return {"error": "No development cluster found for user"}
    
    cluster_id = clusters[0]["id"]
    
    # Start with updated environment
    start_config = {
        "wait_for_ready": True,
        "timeout_minutes": 10,
        "environment_updates": {
            "USER_ID": user_id,
            "PROJECT_ID": project_id,
            "WORKSPACE": f"/workspace/{user_id}"
        },
        "auto_terminate_updates": {
            "enabled": True,
            "idle_minutes": 120  # 2 hours for dev work
        }
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/start",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=start_config
    )
    
    return response.json()

Training Job Resumption

Resume training jobs from checkpoints with updated configurations.
async function resumeTrainingJob(jobId, checkpointPath, newConfig = {}) {
  // Find training cluster for this job
  const clustersResponse = await fetch(`https://api.tensorone.ai/v1/clusters?search=training-${jobId}&status=stopped`, {
    headers: { 'Authorization': 'Bearer ' + API_KEY }
  });
  
  const clusters = await clustersResponse.json();
  
  if (!clusters.success || clusters.data.clusters.length === 0) {
    throw new Error(`No training cluster found for job ${jobId}`);
  }
  
  const clusterId = clusters.data.clusters[0].id;
  
  const startConfig = {
    wait_for_ready: true,
    timeout_minutes: 15,
    environment_updates: {
      RESUME_FROM_CHECKPOINT: checkpointPath,
      JOB_ID: jobId,
      ...newConfig.environment
    },
    update_configuration: {
      ...newConfig.hardware
    },
    auto_terminate_updates: {
      enabled: true,
      cost_limit_usd: newConfig.costLimit || 500.0,
      idle_minutes: 30
    }
  };
  
  const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}/start`, {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(startConfig)
  });
  
  const result = await response.json();
  
  if (result.success) {
    return {
      clusterId: result.data.id,
      jobId: jobId,
      accessUrl: result.data.network.proxy_url,
      sshCommand: result.data.network.ssh_connection.connection_string,
      estimatedCost: result.data.cost.hourly_rate
    };
  }
  
  throw new Error(`Failed to resume training job: ${result.error.message}`);
}

Scheduled Cluster Activation

Start clusters on a schedule for batch processing jobs.
def scheduled_cluster_start(cluster_configs, schedule_time=None):
    """Start multiple clusters for scheduled batch jobs"""
    
    import schedule
    import threading
    
    def start_batch_clusters():
        results = []
        
        for config in cluster_configs:
            cluster_id = config["cluster_id"]
            updates = config.get("updates", {})
            
            start_payload = {
                "wait_for_ready": False,
                "update_configuration": updates.get("hardware", {}),
                "environment_updates": {
                    "BATCH_JOB_ID": config.get("job_id"),
                    "START_TIME": datetime.now().isoformat(),
                    **updates.get("environment", {})
                },
                "auto_terminate_updates": {
                    "enabled": True,
                    "max_runtime_hours": config.get("max_runtime_hours", 8),
                    "cost_limit_usd": config.get("cost_limit", 100.0)
                }
            }
            
            response = requests.post(
                f"https://api.tensorone.ai/v1/clusters/{cluster_id}/start",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json=start_payload
            )
            
            results.append({
                "cluster_id": cluster_id,
                "job_id": config.get("job_id"),
                "success": response.json().get("success", False),
                "response": response.json()
            })
        
        return results
    
    if schedule_time:
        schedule.every().day.at(schedule_time).do(start_batch_clusters)
        
        def run_scheduler():
            while True:
                schedule.run_pending()
                time.sleep(60)
        
        scheduler_thread = threading.Thread(target=run_scheduler)
        scheduler_thread.daemon = True
        scheduler_thread.start()
    else:
        return start_batch_clusters()

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster is not in a stopped state",
    "details": {
      "current_status": "running",
      "required_status": "stopped",
      "suggestion": "Stop the cluster first or wait for it to finish current operations"
    }
  }
}

Security Considerations

  • State Validation: Ensure clusters are in the correct state before starting
  • Configuration Updates: Validate configuration changes don’t compromise security
  • Resource Limits: Monitor resource allocation to prevent quota violations
  • Access Control: Verify permissions for configuration updates

Best Practices

  1. Startup Monitoring: Always monitor startup progress for production clusters
  2. Configuration Validation: Test configuration updates in development first
  3. Cost Management: Set appropriate auto-termination limits
  4. Resource Planning: Consider resource availability during peak hours
  5. Backup Strategy: Use snapshots before major configuration changes
  6. Error Handling: Implement proper error handling and retry logic

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id
string
required

Response

200 - application/json

Cluster start initiated

The response is of type object.