Stop Cluster
curl --request POST \
  --url https://api.tensorone.ai/v2/clusters/{cluster_id}/stop \
  --header 'Authorization: <api-key>'
{
  "id": "<string>",
  "name": "<string>",
  "status": "running",
  "gpuType": "<string>",
  "containerDiskSize": 123,
  "volumeSize": 123,
  "createdAt": "2023-11-07T05:31:56Z"
}

Overview

The Stop Cluster endpoint allows you to stop running GPU clusters either gracefully (allowing running processes to complete) or forcefully (immediate termination). This is essential for cost management, maintenance, and resource optimization.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/stop

Path Parameters

ParameterTypeRequiredDescription
cluster_idstringYesUnique cluster identifier

Request Body

ParameterTypeRequiredDescription
forcebooleanNoForce immediate stop without graceful shutdown (default: false)
grace_period_minutesintegerNoGrace period for graceful shutdown (default: 5, max: 30)
save_statebooleanNoCreate snapshot before stopping (default: false)
snapshot_namestringNoCustom name for the snapshot
preserve_databooleanNoPreserve data volumes (default: true)
wait_for_completionbooleanNoWait for stop operation to complete (default: false)
timeout_minutesintegerNoMaximum wait time for completion (default: 10, max: 60)
stop_reasonstringNoReason for stopping (for audit logs)
notify_usersarrayNoUser IDs to notify about the stop operation

Request Examples

# Basic graceful stop
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "grace_period_minutes": 10,
    "stop_reason": "Scheduled maintenance"
  }'

# Force stop with snapshot
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "force": true,
    "save_state": true,
    "snapshot_name": "emergency_stop_2024_01_15",
    "wait_for_completion": true,
    "stop_reason": "Emergency maintenance"
  }'

# Graceful stop with user notification
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "grace_period_minutes": 15,
    "save_state": true,
    "preserve_data": true,
    "notify_users": ["user_123", "user_456"],
    "stop_reason": "Cost optimization - end of business day"
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "stopping",
    "stop_initiated_at": "2024-01-15T16:30:00Z",
    "estimated_stop_time": "2024-01-15T16:40:00Z",
    "stop_type": "graceful",
    "grace_period_minutes": 10,
    "stop_progress": {
      "phase": "notifying_processes",
      "percentage": 20,
      "current_step": "Sending SIGTERM to running processes",
      "steps_completed": 2,
      "total_steps": 6
    },
    "snapshot_creation": {
      "enabled": false,
      "snapshot_id": null
    },
    "cost_impact": {
      "hourly_rate_saved": 12.50,
      "session_cost_final": 45.75,
      "billing_stopped_at": "2024-01-15T16:30:00Z"
    },
    "data_preservation": {
      "volumes_preserved": ["root", "data"],
      "snapshots_created": 0
    }
  },
  "meta": {
    "request_id": "req_stop_456",
    "wait_for_completion": false
  }
}

Stop Progress Phases

When stopping clusters, the system goes through several phases:
PhaseDescriptionTypical Duration
notifying_processesSending termination signals to running processes10-30 seconds
waiting_for_graceful_exitAllowing processes to shut down cleanly1-15 minutes
creating_snapshotCreating state snapshot (if requested)30 seconds - 5 minutes
terminating_resourcesReleasing GPU and compute resources30-60 seconds
cleaning_upFinal cleanup and state updates10-30 seconds
completedStop operation finished-

Use Cases

Cost Optimization

Automatically stop idle clusters to save costs.
def stop_idle_clusters(idle_threshold_minutes=30):
    """Stop clusters that have been idle for too long"""
    
    # Get all running clusters
    response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"status": "running"}
    )
    
    clusters = response.json()["data"]["clusters"]
    stopped_clusters = []
    
    for cluster in clusters:
        # Check if cluster has been idle
        last_activity = cluster.get("last_accessed")
        if last_activity:
            from datetime import datetime, timedelta
            last_activity_time = datetime.fromisoformat(last_activity.replace('Z', '+00:00'))
            idle_time = datetime.now().timestamp() - last_activity_time.timestamp()
            idle_minutes = idle_time / 60
            
            if idle_minutes > idle_threshold_minutes:
                # Stop idle cluster
                result = stop_cluster_graceful(
                    cluster["id"], 
                    grace_minutes=5,
                    reason=f"Auto-stop: idle for {idle_minutes:.1f} minutes"
                )
                
                if result["success"]:
                    stopped_clusters.append({
                        "cluster_id": cluster["id"],
                        "name": cluster["name"],
                        "idle_minutes": idle_minutes,
                        "cost_saved_per_hour": cluster["cost"]["hourly_rate"]
                    })
    
    return stopped_clusters

Scheduled Maintenance

Stop clusters for scheduled maintenance windows.
async function scheduleMaintenanceStop(clusterIds, maintenanceWindow) {
  const results = [];
  
  for (const clusterId of clusterIds) {
    try {
      // Get cluster info to determine appropriate grace period
      const clusterResponse = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
        headers: { 'Authorization': 'Bearer ' + API_KEY }
      });
      
      const cluster = await clusterResponse.json();
      
      if (!cluster.success) {
        results.push({ clusterId, success: false, error: cluster.error.message });
        continue;
      }
      
      // Determine grace period based on cluster type
      let gracePeriod = 5; // default
      if (cluster.data.name.includes('training')) {
        gracePeriod = 15; // More time for training clusters
      } else if (cluster.data.name.includes('prod')) {
        gracePeriod = 10; // Production clusters need time to drain
      }
      
      const stopResult = await stopClusterGracefully(clusterId, {
        gracePeriod: gracePeriod,
        createSnapshot: true,
        snapshotName: `maintenance_${maintenanceWindow.id}_${clusterId}`,
        reason: `Scheduled maintenance: ${maintenanceWindow.description}`,
        notifyUsers: cluster.data.project_info.team_members || []
      });
      
      results.push({
        clusterId,
        success: true,
        gracePeriod,
        snapshotId: stopResult.snapshot_id,
        estimatedDowntime: maintenanceWindow.duration_minutes
      });
      
    } catch (error) {
      results.push({
        clusterId,
        success: false,
        error: error.message
      });
    }
  }
  
  return results;
}

Training Completion Handler

Stop training clusters when jobs complete with proper state preservation.
def handle_training_completion(cluster_id, model_name, final_checkpoint_path):
    """Handle training completion with comprehensive state saving"""
    
    # Create final snapshot with model artifacts
    snapshot_name = f"final_model_{model_name}_{int(time.time())}"
    
    stop_config = {
        "force": False,
        "grace_period_minutes": 20,  # Allow time for final checkpoint
        "save_state": True,
        "snapshot_name": snapshot_name,
        "preserve_data": True,
        "wait_for_completion": True,
        "stop_reason": f"Training completed for model: {model_name}",
        "timeout_minutes": 30
    }
    
    # Add environment variable to signal final checkpoint save
    # This would be picked up by the training script
    update_env_response = requests.patch(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/environment",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "TRAINING_COMPLETED": "true",
            "FINAL_CHECKPOINT_PATH": final_checkpoint_path,
            "MODEL_NAME": model_name
        }
    )
    
    # Wait a bit for the training script to notice and save final checkpoint
    time.sleep(60)
    
    # Now stop the cluster
    response = requests.post(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/stop",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=stop_config
    )
    
    result = response.json()
    
    if result["success"]:
        completion_info = {
            "cluster_id": cluster_id,
            "model_name": model_name,
            "final_snapshot_id": result["data"]["snapshot_creation"]["snapshot_id"],
            "training_duration_hours": result["data"]["cost_summary"]["session_duration_hours"],
            "total_cost": result["data"]["cost_summary"]["total_session_cost"],
            "final_checkpoint_path": final_checkpoint_path,
            "stopped_at": result["data"]["stopped_at"]
        }
        
        # Log completion
        print(f"Training completed for {model_name}")
        print(f"Final snapshot: {completion_info['final_snapshot_id']}")
        print(f"Total cost: ${completion_info['total_cost']:.2f}")
        
        return completion_info
    
    return result

Batch Cluster Management

Stop multiple clusters with different strategies based on their usage patterns.
def intelligent_batch_stop(project_id, stop_criteria):
    """Stop clusters based on intelligent criteria"""
    
    # Get all clusters for the project
    response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "project_id": project_id,
            "status": "running",
            "include_metrics": True
        }
    )
    
    clusters = response.json()["data"]["clusters"]
    stop_decisions = []
    
    for cluster in clusters:
        decision = {
            "cluster_id": cluster["id"],
            "name": cluster["name"],
            "should_stop": False,
            "reason": "",
            "stop_config": {}
        }
        
        metrics = cluster["metrics"]["current"]
        cost = cluster["cost"]["hourly_rate"]
        
        # Criteria 1: Low utilization + high cost
        if metrics["gpu_utilization"] < stop_criteria.get("min_gpu_util", 20) and cost > stop_criteria.get("max_cost_for_low_util", 5.0):
            decision["should_stop"] = True
            decision["reason"] = f"Low utilization ({metrics['gpu_utilization']}%) with high cost (${cost}/hr)"
            decision["stop_config"] = {
                "grace_period_minutes": 5,
                "save_state": False,
                "reason": "Cost optimization - low utilization"
            }
        
        # Criteria 2: Development clusters after hours
        elif "dev" in cluster["name"].lower() and stop_criteria.get("stop_dev_after_hours", False):
            decision["should_stop"] = True
            decision["reason"] = "Development cluster - after hours shutdown"
            decision["stop_config"] = {
                "grace_period_minutes": 10,
                "save_state": True,
                "reason": "Scheduled dev environment shutdown"
            }
        
        # Criteria 3: Idle clusters
        elif cluster.get("uptime_seconds", 0) > stop_criteria.get("max_idle_seconds", 3600):
            last_access = cluster.get("last_accessed")
            if last_access:
                # Calculate idle time and decide
                pass  # Implementation depends on last access format
        
        stop_decisions.append(decision)
    
    # Execute stop operations
    results = []
    for decision in stop_decisions:
        if decision["should_stop"]:
            result = stop_cluster_graceful(
                decision["cluster_id"],
                **decision["stop_config"]
            )
            results.append({
                "cluster_id": decision["cluster_id"],
                "name": decision["name"],
                "reason": decision["reason"],
                "success": result["success"],
                "response": result
            })
    
    return results

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster is not in a running state",
    "details": {
      "current_status": "stopped",
      "required_status": "running",
      "suggestion": "Cluster is already stopped"
    }
  }
}

Security Considerations

  • Data Protection: Always preserve important data before stopping clusters
  • Process Safety: Use appropriate grace periods for critical workloads
  • Access Control: Verify permissions before stopping shared clusters
  • Audit Logging: Include stop reasons for compliance and troubleshooting

Best Practices

  1. Graceful Shutdown: Always prefer graceful stops over forced termination
  2. State Preservation: Create snapshots for important work states
  3. Cost Monitoring: Use stop operations for effective cost management
  4. Communication: Notify team members before stopping shared clusters
  5. Automation: Implement intelligent stopping based on usage patterns
  6. Data Backup: Ensure critical data is backed up before stopping clusters

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id
string
required

Response

200 - application/json

Cluster stop initiated

The response is of type object.