Stop Cluster

Overview

The Stop Cluster endpoint allows you to stop running GPU clusters either gracefully (allowing running processes to complete) or forcefully (immediate termination). This is essential for cost management, maintenance, and resource optimization.

Endpoint

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/stop

Path Parameters

Parameter	Type	Required	Description
`cluster_id`	string	Yes	Unique cluster identifier

Request Body

Parameter	Type	Required	Description
`force`	boolean	No	Force immediate stop without graceful shutdown (default: false)
`grace_period_minutes`	integer	No	Grace period for graceful shutdown (default: 5, max: 30)
`save_state`	boolean	No	Create snapshot before stopping (default: false)
`snapshot_name`	string	No	Custom name for the snapshot
`preserve_data`	boolean	No	Preserve data volumes (default: true)
`wait_for_completion`	boolean	No	Wait for stop operation to complete (default: false)
`timeout_minutes`	integer	No	Maximum wait time for completion (default: 10, max: 60)
`stop_reason`	string	No	Reason for stopping (for audit logs)
`notify_users`	array	No	User IDs to notify about the stop operation

Request Examples

# Basic graceful stop
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "grace_period_minutes": 10,
    "stop_reason": "Scheduled maintenance"
  }'

# Force stop with snapshot
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "force": true,
    "save_state": true,
    "snapshot_name": "emergency_stop_2024_01_15",
    "wait_for_completion": true,
    "stop_reason": "Emergency maintenance"
  }'

# Graceful stop with user notification
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "grace_period_minutes": 15,
    "save_state": true,
    "preserve_data": true,
    "notify_users": ["user_123", "user_456"],
    "stop_reason": "Cost optimization - end of business day"
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "status": "stopping",
    "stop_initiated_at": "2024-01-15T16:30:00Z",
    "estimated_stop_time": "2024-01-15T16:40:00Z",
    "stop_type": "graceful",
    "grace_period_minutes": 10,
    "stop_progress": {
      "phase": "notifying_processes",
      "percentage": 20,
      "current_step": "Sending SIGTERM to running processes",
      "steps_completed": 2,
      "total_steps": 6
    },
    "snapshot_creation": {
      "enabled": false,
      "snapshot_id": null
    },
    "cost_impact": {
      "hourly_rate_saved": 12.50,
      "session_cost_final": 45.75,
      "billing_stopped_at": "2024-01-15T16:30:00Z"
    },
    "data_preservation": {
      "volumes_preserved": ["root", "data"],
      "snapshots_created": 0
    }
  },
  "meta": {
    "request_id": "req_stop_456",
    "wait_for_completion": false
  }
}

Stop Progress Phases

When stopping clusters, the system goes through several phases:

Phase	Description	Typical Duration
`notifying_processes`	Sending termination signals to running processes	10-30 seconds
`waiting_for_graceful_exit`	Allowing processes to shut down cleanly	1-15 minutes
`creating_snapshot`	Creating state snapshot (if requested)	30 seconds - 5 minutes
`terminating_resources`	Releasing GPU and compute resources	30-60 seconds
`cleaning_up`	Final cleanup and state updates	10-30 seconds
`completed`	Stop operation finished	-

Use Cases

Cost Optimization

Automatically stop idle clusters to save costs.

def stop_idle_clusters(idle_threshold_minutes=30):
    """Stop clusters that have been idle for too long"""
    
    # Get all running clusters
    response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"status": "running"}
    )
    
    clusters = response.json()["data"]["clusters"]
    stopped_clusters = []
    
    for cluster in clusters:
        # Check if cluster has been idle
        last_activity = cluster.get("last_accessed")
        if last_activity:
            from datetime import datetime, timedelta
            last_activity_time = datetime.fromisoformat(last_activity.replace('Z', '+00:00'))
            idle_time = datetime.now().timestamp() - last_activity_time.timestamp()
            idle_minutes = idle_time / 60
            
            if idle_minutes > idle_threshold_minutes:
                # Stop idle cluster
                result = stop_cluster_graceful(
                    cluster["id"], 
                    grace_minutes=5,
                    reason=f"Auto-stop: idle for {idle_minutes:.1f} minutes"
                )
                
                if result["success"]:
                    stopped_clusters.append({
                        "cluster_id": cluster["id"],
                        "name": cluster["name"],
                        "idle_minutes": idle_minutes,
                        "cost_saved_per_hour": cluster["cost"]["hourly_rate"]
                    })
    
    return stopped_clusters

Scheduled Maintenance

Stop clusters for scheduled maintenance windows.

async function scheduleMaintenanceStop(clusterIds, maintenanceWindow) {
  const results = [];
  
  for (const clusterId of clusterIds) {
    try {
      // Get cluster info to determine appropriate grace period
      const clusterResponse = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
        headers: { 'Authorization': 'Bearer ' + API_KEY }
      });
      
      const cluster = await clusterResponse.json();
      
      if (!cluster.success) {
        results.push({ clusterId, success: false, error: cluster.error.message });
        continue;
      }
      
      // Determine grace period based on cluster type
      let gracePeriod = 5; // default
      if (cluster.data.name.includes('training')) {
        gracePeriod = 15; // More time for training clusters
      } else if (cluster.data.name.includes('prod')) {
        gracePeriod = 10; // Production clusters need time to drain
      }
      
      const stopResult = await stopClusterGracefully(clusterId, {
        gracePeriod: gracePeriod,
        createSnapshot: true,
        snapshotName: `maintenance_${maintenanceWindow.id}_${clusterId}`,
        reason: `Scheduled maintenance: ${maintenanceWindow.description}`,
        notifyUsers: cluster.data.project_info.team_members || []
      });
      
      results.push({
        clusterId,
        success: true,
        gracePeriod,
        snapshotId: stopResult.snapshot_id,
        estimatedDowntime: maintenanceWindow.duration_minutes
      });
      
    } catch (error) {
      results.push({
        clusterId,
        success: false,
        error: error.message
      });
    }
  }
  
  return results;
}

Training Completion Handler

Stop training clusters when jobs complete with proper state preservation.

def handle_training_completion(cluster_id, model_name, final_checkpoint_path):
    """Handle training completion with comprehensive state saving"""
    
    # Create final snapshot with model artifacts
    snapshot_name = f"final_model_{model_name}_{int(time.time())}"
    
    stop_config = {
        "force": False,
        "grace_period_minutes": 20,  # Allow time for final checkpoint
        "save_state": True,
        "snapshot_name": snapshot_name,
        "preserve_data": True,
        "wait_for_completion": True,
        "stop_reason": f"Training completed for model: {model_name}",
        "timeout_minutes": 30
    }
    
    # Add environment variable to signal final checkpoint save
    # This would be picked up by the training script
    update_env_response = requests.patch(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/environment",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "TRAINING_COMPLETED": "true",
            "FINAL_CHECKPOINT_PATH": final_checkpoint_path,
            "MODEL_NAME": model_name
        }
    )
    
    # Wait a bit for the training script to notice and save final checkpoint
    time.sleep(60)
    
    # Now stop the cluster
    response = requests.post(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}/stop",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=stop_config
    )
    
    result = response.json()
    
    if result["success"]:
        completion_info = {
            "cluster_id": cluster_id,
            "model_name": model_name,
            "final_snapshot_id": result["data"]["snapshot_creation"]["snapshot_id"],
            "training_duration_hours": result["data"]["cost_summary"]["session_duration_hours"],
            "total_cost": result["data"]["cost_summary"]["total_session_cost"],
            "final_checkpoint_path": final_checkpoint_path,
            "stopped_at": result["data"]["stopped_at"]
        }
        
        # Log completion
        print(f"Training completed for {model_name}")
        print(f"Final snapshot: {completion_info['final_snapshot_id']}")
        print(f"Total cost: ${completion_info['total_cost']:.2f}")
        
        return completion_info
    
    return result

Batch Cluster Management

Stop multiple clusters with different strategies based on their usage patterns.

def intelligent_batch_stop(project_id, stop_criteria):
    """Stop clusters based on intelligent criteria"""
    
    # Get all clusters for the project
    response = requests.get(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "project_id": project_id,
            "status": "running",
            "include_metrics": True
        }
    )
    
    clusters = response.json()["data"]["clusters"]
    stop_decisions = []
    
    for cluster in clusters:
        decision = {
            "cluster_id": cluster["id"],
            "name": cluster["name"],
            "should_stop": False,
            "reason": "",
            "stop_config": {}
        }
        
        metrics = cluster["metrics"]["current"]
        cost = cluster["cost"]["hourly_rate"]
        
        # Criteria 1: Low utilization + high cost
        if metrics["gpu_utilization"] < stop_criteria.get("min_gpu_util", 20) and cost > stop_criteria.get("max_cost_for_low_util", 5.0):
            decision["should_stop"] = True
            decision["reason"] = f"Low utilization ({metrics['gpu_utilization']}%) with high cost (${cost}/hr)"
            decision["stop_config"] = {
                "grace_period_minutes": 5,
                "save_state": False,
                "reason": "Cost optimization - low utilization"
            }
        
        # Criteria 2: Development clusters after hours
        elif "dev" in cluster["name"].lower() and stop_criteria.get("stop_dev_after_hours", False):
            decision["should_stop"] = True
            decision["reason"] = "Development cluster - after hours shutdown"
            decision["stop_config"] = {
                "grace_period_minutes": 10,
                "save_state": True,
                "reason": "Scheduled dev environment shutdown"
            }
        
        # Criteria 3: Idle clusters
        elif cluster.get("uptime_seconds", 0) > stop_criteria.get("max_idle_seconds", 3600):
            last_access = cluster.get("last_accessed")
            if last_access:
                # Calculate idle time and decide
                pass  # Implementation depends on last access format
        
        stop_decisions.append(decision)
    
    # Execute stop operations
    results = []
    for decision in stop_decisions:
        if decision["should_stop"]:
            result = stop_cluster_graceful(
                decision["cluster_id"],
                **decision["stop_config"]
            )
            results.append({
                "cluster_id": decision["cluster_id"],
                "name": decision["name"],
                "reason": decision["reason"],
                "success": result["success"],
                "response": result
            })
    
    return results

Error Handling

{
  "success": false,
  "error": {
    "code": "INVALID_STATE",
    "message": "Cluster is not in a running state",
    "details": {
      "current_status": "stopped",
      "required_status": "running",
      "suggestion": "Cluster is already stopped"
    }
  }
}

Security Considerations

Data Protection: Always preserve important data before stopping clusters
Process Safety: Use appropriate grace periods for critical workloads
Access Control: Verify permissions before stopping shared clusters
Audit Logging: Include stop reasons for compliance and troubleshooting

Best Practices

Graceful Shutdown: Always prefer graceful stops over forced termination
State Preservation: Create snapshots for important work states
Cost Monitoring: Use stop operations for effective cost management
Communication: Notify team members before stopping shared clusters
Automation: Implement intelligent stopping based on usage patterns
Data Backup: Ensure critical data is backed up before stopping clusters

Authorizations

Authorization

string

header

required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id

string

required

Response

200 - application/json

Cluster stop initiated

The response is of type object.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Overview

Endpoint

Path Parameters

Request Body

Request Examples

Response Schema

Stop Progress Phases

Use Cases

Cost Optimization

Scheduled Maintenance

Training Completion Handler

Batch Cluster Management

Error Handling

Security Considerations

Best Practices

Authorizations

Path Parameters

Response

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Overview

​Endpoint

​Path Parameters

​Request Body

​Request Examples

​Response Schema

​Stop Progress Phases

​Use Cases

​Cost Optimization

​Scheduled Maintenance

​Training Completion Handler

​Batch Cluster Management

​Error Handling

​Security Considerations

​Best Practices

Authorizations

Path Parameters

Response

Overview

Endpoint

Path Parameters

Request Body

Request Examples

Response Schema

Stop Progress Phases

Use Cases

Cost Optimization

Scheduled Maintenance

Training Completion Handler

Batch Cluster Management

Error Handling

Security Considerations

Best Practices