Start Cluster

Overview

The Start Cluster endpoint allows you to start previously stopped GPU clusters, optionally updating configuration settings during startup. This enables cost-effective cluster management by stopping clusters when not in use and quickly resuming work when needed.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/start

Path Parameters

Parameter	Type	Required	Description
`cluster_id`	string	Yes	Unique cluster identifier

Request Body

Parameter	Type	Required	Description
`wait_for_ready`	boolean	No	Wait for cluster to be fully ready before returning (default: false)
`timeout_minutes`	integer	No	Maximum wait time in minutes (default: 10, max: 30)
`update_configuration`	object	No	Configuration updates to apply during startup
`restore_from_snapshot`	string	No	Snapshot ID to restore from
`environment_updates`	object	No	Environment variable updates
`port_mapping_updates`	array	No	Port mapping changes
`auto_terminate_updates`	object	No	Auto-termination setting updates

Configuration Updates

{
  "update_configuration": {
    "cpu_cores": 64,              // Update CPU allocation
    "memory_gb": 512,             // Update memory allocation
    "storage_gb": 2000,           // Expand storage (cannot shrink)
    "docker_image": "new_image",  // Update Docker image
    "gpu_count": 8                // Update GPU count (if available)
  }
}

Request Examples

# Basic cluster start
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "timeout_minutes": 15
  }'

# Start with configuration updates
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": true,
    "update_configuration": {
      "memory_gb": 512,
      "docker_image": "pytorch/pytorch:2.2-cuda12.1-devel"
    },
    "environment_updates": {
      "BATCH_SIZE": "64",
      "LEARNING_RATE": "0.001"
    },
    "auto_terminate_updates": {
      "enabled": true,
      "idle_minutes": 30
    }
  }'

# Start and restore from snapshot
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/start" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "wait_for_ready": false,
    "restore_from_snapshot": "snap_xyz789",
    "port_mapping_updates": [
      {
        "internal_port": 8888,
        "external_port": 0,
        "protocol": "tcp",
        "description": "Jupyter Lab"
      }
    ]
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "starting",
    "start_initiated_at": "2024-01-15T16:00:00Z",
    "estimated_ready_at": "2024-01-15T16:05:00Z",
    "configuration_updates_applied": [
      "memory_gb: 256 -> 512",
      "docker_image: pytorch:2.1 -> pytorch:2.2"
    ],
    "startup_progress": {
      "phase": "initializing",
      "percentage": 15,
      "current_step": "Allocating resources",
      "steps_completed": 2,
      "total_steps": 8
    },
    "cost": {
      "new_hourly_rate": 12.50,
      "previous_hourly_rate": 8.50,
      "rate_change_reason": "Memory upgrade"
    }
  },
  "meta": {
    "request_id": "req_start_123",
    "wait_for_ready": false
  }
}

Startup Progress Tracking

When wait_for_ready is false, you can poll the cluster status to track startup progress:

def track_startup_progress(cluster_id):
    """Track cluster startup progress with real-time updates"""
    
    import time
    
    while True:
        response = requests.get(
            f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        
        cluster = response.json()["data"]
        status = cluster["status"]
        
        if status == "running":
            print(f"✅ Cluster {cluster_id} is ready!")
            print(f"Access URL: {cluster['network']['proxy_url']}")
            break
        elif status == "starting":
            # Show progress if available
            if "startup_progress" in cluster:
                progress = cluster["startup_progress"]
                print(f"🔄 {progress['current_step']} ({progress['percentage']}%)")
            else:
                print(f"🔄 Starting cluster... (Status: {status})")
        elif status == "error":
            print(f"❌ Cluster startup failed: {cluster.get('error_message', 'Unknown error')}")
            break
        
        time.sleep(10)  # Check every 10 seconds

Use Cases

Development Workflow

Start development clusters when team members begin work.

def start_dev_environment(user_id, project_id):
    """Start a development environment for a user"""
    
    # Find user's development cluster
    clusters_response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "project_id": project_id,
            "search": f"dev-{user_id}",
            "status": "stopped"
        }
    )
    
    clusters = clusters_response.json()["data"]["clusters"]
    
    if not clusters:
        return {"error": "No development cluster found for user"}
    
    cluster_id = clusters[0]["id"]
    
    # Start with updated environment
    start_config = {
        "wait_for_ready": True,
        "timeout_minutes": 10,
        "environment_updates": {
            "USER_ID": user_id,
            "PROJECT_ID": project_id,
            "WORKSPACE": f"/workspace/{user_id}"
        },
        "auto_terminate_updates": {
            "enabled": True,
            "idle_minutes": 120  # 2 hours for dev work
        }
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/start",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=start_config
    )
    
    return response.json()

Training Job Resumption

Resume training jobs from checkpoints with updated configurations.

async function resumeTrainingJob(jobId, checkpointPath, newConfig = {}) {
  // Find training cluster for this job
  const clustersResponse = await fetch(`https://api.tensorone.ai/v1/clusters?search=training-${jobId}&status=stopped`, {
    headers: { 'Authorization': 'Bearer ' + API_KEY }
  });
  
  const clusters = await clustersResponse.json();
  
  if (!clusters.success || clusters.data.clusters.length === 0) {
    throw new Error(`No training cluster found for job ${jobId}`);
  }
  
  const clusterId = clusters.data.clusters[0].id;
  
  const startConfig = {
    wait_for_ready: true,
    timeout_minutes: 15,
    environment_updates: {
      RESUME_FROM_CHECKPOINT: checkpointPath,
      JOB_ID: jobId,
      ...newConfig.environment
    },
    update_configuration: {
      ...newConfig.hardware
    },
    auto_terminate_updates: {
      enabled: true,
      cost_limit_usd: newConfig.costLimit || 500.0,
      idle_minutes: 30
    }
  };
  
  const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}/start`, {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(startConfig)
  });
  
  const result = await response.json();
  
  if (result.success) {
    return {
      clusterId: result.data.id,
      jobId: jobId,
      accessUrl: result.data.network.proxy_url,
      sshCommand: result.data.network.ssh_connection.connection_string,
      estimatedCost: result.data.cost.hourly_rate
    };
  }
  
  throw new Error(`Failed to resume training job: ${result.error.message}`);
}

Scheduled Cluster Activation

Start clusters on a schedule for batch processing jobs.

def scheduled_cluster_start(cluster_configs, schedule_time=None):
    """Start multiple clusters for scheduled batch jobs"""
    
    import schedule
    import threading
    
    def start_batch_clusters():
        results = []
        
        for config in cluster_configs:
            cluster_id = config["cluster_id"]
            updates = config.get("updates", {})
            
            start_payload = {
                "wait_for_ready": False,
                "update_configuration": updates.get("hardware", {}),
                "environment_updates": {
                    "BATCH_JOB_ID": config.get("job_id"),
                    "START_TIME": datetime.now().isoformat(),
                    **updates.get("environment", {})
                },
                "auto_terminate_updates": {
                    "enabled": True,
                    "max_runtime_hours": config.get("max_runtime_hours", 8),
                    "cost_limit_usd": config.get("cost_limit", 100.0)
                }
            }
            
            response = requests.post(
                f"https://api.tensorone.ai/v1/clusters/{cluster_id}/start",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json=start_payload
            )
            
            results.append({
                "cluster_id": cluster_id,
                "job_id": config.get("job_id"),
                "success": response.json().get("success", False),
                "response": response.json()
            })
        
        return results
    
    if schedule_time:
        schedule.every().day.at(schedule_time).do(start_batch_clusters)
        
        def run_scheduler():
            while True:
                schedule.run_pending()
                time.sleep(60)
        
        scheduler_thread = threading.Thread(target=run_scheduler)
        scheduler_thread.daemon = True
        scheduler_thread.start()
    else:
        return start_batch_clusters()

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster is not in a stopped state",
    "details": {
      "current_status": "running",
      "required_status": "stopped",
      "suggestion": "Stop the cluster first or wait for it to finish current operations"
    }
  }
}

Security Considerations

State Validation: Ensure clusters are in the correct state before starting
Configuration Updates: Validate configuration changes don’t compromise security
Resource Limits: Monitor resource allocation to prevent quota violations
Access Control: Verify permissions for configuration updates

Best Practices

Startup Monitoring: Always monitor startup progress for production clusters
Configuration Validation: Test configuration updates in development first
Cost Management: Set appropriate auto-termination limits
Resource Planning: Consider resource availability during peak hours
Backup Strategy: Use snapshots before major configuration changes
Error Handling: Implement proper error handling and retry logic

Authorizations

Authorization

string

header

required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id

string

required

Response

200 - application/json

Cluster start initiated

The response is of type object.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Overview

Endpoint

Path Parameters

Request Body

Configuration Updates

Request Examples

Response Schema

Startup Progress Tracking

Use Cases

Development Workflow

Training Job Resumption

Scheduled Cluster Activation

Error Handling

Security Considerations

Best Practices

Authorizations

Path Parameters

Response

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Overview

​Endpoint

​Path Parameters

​Request Body

​Configuration Updates

​Request Examples

​Response Schema

​Startup Progress Tracking

​Use Cases

​Development Workflow

​Training Job Resumption

​Scheduled Cluster Activation

​Error Handling

​Security Considerations

​Best Practices

Authorizations

Path Parameters

Response

Overview

Endpoint

Path Parameters

Request Body

Configuration Updates

Request Examples

Response Schema

Startup Progress Tracking

Use Cases

Development Workflow

Training Job Resumption

Scheduled Cluster Activation

Error Handling

Security Considerations

Best Practices