Create Cluster
curl --request POST \
  --url https://api.tensorone.ai/v2/clusters \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "<string>",
  "imageName": "<string>",
  "containerDiskInGb": 500,
  "volumeInGb": 5000,
  "gpuType": "<string>",
  "env": {}
}'
{
  "id": "<string>",
  "name": "<string>",
  "status": "running",
  "gpuType": "<string>",
  "containerDiskSize": 123,
  "volumeSize": 123,
  "createdAt": "2023-11-07T05:31:56Z"
}

Overview

The Create Cluster endpoint allows you to provision new GPU clusters with flexible configurations including GPU types, storage options, networking, and security settings. Perfect for ML training, development environments, and production AI workloads.

Endpoint

POST https://api.tensorone.ai/v1/clusters

Request Body

ParameterTypeRequiredDescription
namestringYesCluster name (3-64 characters, alphanumeric and hyphens)
descriptionstringNoOptional cluster description
gpu_typestringYesGPU type: A100, H100, RTX4090, V100, T4, RTX3090
gpu_countintegerYesNumber of GPUs (1-8 depending on GPU type)
cpu_coresintegerNoCPU cores (auto-calculated if not specified)
memory_gbintegerNoRAM in GB (auto-calculated if not specified)
storage_gbintegerYesPersistent storage in GB (minimum 50GB)
regionstringYesDeployment region
project_idstringYesProject ID for organization
template_idstringNoTemplate ID for pre-configured environments
docker_imagestringNoCustom Docker image (if not using template)
environment_variablesobjectNoEnvironment variables for the cluster
ssh_enabledbooleanNoEnable SSH access (default: true)
ssh_public_keysarrayNoSSH public keys for access
port_mappingsarrayNoPort forwarding configuration
auto_startbooleanNoStart cluster immediately (default: true)
auto_terminateobjectNoAuto-termination settings
network_configobjectNoAdvanced networking configuration
security_groupsarrayNoSecurity group IDs
tagsobjectNoResource tags for organization

Request Examples

# Create basic ML training cluster
curl -X POST "https://api.tensorone.ai/v1/clusters" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llm-training-cluster",
    "description": "Large language model training environment",
    "gpu_type": "A100",
    "gpu_count": 4,
    "storage_gb": 1000,
    "region": "us-west-2",
    "project_id": "proj_123",
    "template_id": "tmpl_pytorch_latest",
    "ssh_enabled": true,
    "auto_terminate": {
      "enabled": true,
      "idle_minutes": 60,
      "max_runtime_hours": 24
    }
  }'

# Create development cluster with custom Docker image
curl -X POST "https://api.tensorone.ai/v1/clusters" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "dev-environment",
    "gpu_type": "RTX4090",
    "gpu_count": 1,
    "cpu_cores": 16,
    "memory_gb": 64,
    "storage_gb": 500,
    "region": "us-east-1",
    "project_id": "proj_456",
    "docker_image": "pytorch/pytorch:2.1-cuda11.8-devel",
    "environment_variables": {
      "CUDA_VISIBLE_DEVICES": "0",
      "PYTHONPATH": "/workspace",
      "WANDB_API_KEY": "$WANDB_KEY"
    },
    "port_mappings": [
      {
        "internal_port": 8888,
        "external_port": 0,
        "protocol": "tcp",
        "description": "Jupyter Lab"
      }
    ]
  }'

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "llm-training-cluster",
    "description": "Large language model training environment",
    "status": "starting",
    "gpu_type": "A100",
    "gpu_count": 4,
    "cpu_cores": 32,
    "memory_gb": 256,
    "storage_gb": 1000,
    "region": "us-west-2",
    "project_id": "proj_123",
    "template_id": "tmpl_pytorch_latest",
    "docker_image": "tensorone/pytorch:2.1-cuda11.8",
    "ssh_enabled": true,
    "ssh_connection": {
      "host": "ssh-abc123.tensorone.ai",
      "port": 22,
      "username": "root",
      "status": "pending"
    },
    "port_mappings": [
      {
        "internal_port": 8888,
        "external_port": 32001,
        "protocol": "tcp",
        "description": "Jupyter Lab",
        "url": "https://cluster-abc123.tensorone.ai:32001"
      }
    ],
    "proxy_url": "https://cluster-abc123.tensorone.ai",
    "environment_variables": {
      "CUDA_VISIBLE_DEVICES": "0,1,2,3",
      "NCCL_SOCKET_IFNAME": "eth0"
    },
    "cost": {
      "hourly_rate": 8.50,
      "estimated_monthly": 6120.00,
      "currency": "USD"
    },
    "auto_terminate": {
      "enabled": true,
      "idle_minutes": 60,
      "max_runtime_hours": 24,
      "estimated_termination": "2024-01-16T14:30:00Z"
    },
    "network_config": {
      "private_ip": "10.0.1.15",
      "public_ip": "203.0.113.42",
      "bandwidth_limit_mbps": 1000
    },
    "security_groups": ["sg_default_ml"],
    "tags": {
      "team": "ml-research",
      "environment": "training"
    },
    "created_at": "2024-01-15T14:30:00Z",
    "updated_at": "2024-01-15T14:30:00Z",
    "estimated_ready_at": "2024-01-15T14:35:00Z"
  },
  "meta": {
    "request_id": "req_create_789",
    "estimated_setup_time_minutes": 5
  }
}

Configuration Options

GPU Types and Availability

GPU TypeMemoryCoresMax CountHourly RateBest For
A10080GB69128$2.50+Large model training, inference
H10080GB168968$4.00+Latest generation, fastest training
RTX409024GB163844$0.80+Development, medium models
V10032GB51208$1.20+Legacy support, cost-effective
T416GB25604$0.50+Inference, light training

Storage Options

TypeMin SizeMax SizePerformanceUse Case
ssd50GB10TBHigh IOPSOS, applications, fast data access
nvme100GB5TBUltra-high IOPSTraining data, checkpoints
hdd100GB50TBStandardArchives, large datasets

Auto-termination Settings

{
  "auto_terminate": {
    "enabled": true,
    "idle_minutes": 30,           // Terminate after idle time
    "max_runtime_hours": 24,      // Maximum runtime limit
    "cost_limit_usd": 100.0,      // Cost-based termination
    "schedule": {                 // Scheduled termination
      "type": "cron",
      "expression": "0 18 * * 5"  // Every Friday at 6 PM
    }
  }
}

Use Cases

ML Model Training

Create powerful multi-GPU clusters for training large language models and computer vision models.
def create_training_cluster(model_size="large"):
    config = {
        "name": f"training-{model_size}-{int(time.time())}",
        "gpu_type": "A100" if model_size == "large" else "RTX4090",
        "gpu_count": 8 if model_size == "large" else 2,
        "storage_gb": 2000,
        "region": "us-west-2",
        "template_id": "tmpl_pytorch_distributed",
        "environment_variables": {
            "MODEL_SIZE": model_size,
            "BATCH_SIZE": "32" if model_size == "large" else "64"
        },
        "auto_terminate": {
            "enabled": True,
            "cost_limit_usd": 500.0
        }
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v1/clusters",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=config
    )
    
    return response.json()["data"]

Development Environment

Set up interactive development environments with Jupyter, VSCode, and debugging tools.
async function createDevEnvironment(teamMember) {
  const config = {
    name: `dev-${teamMember.username}`,
    description: `Development environment for ${teamMember.name}`,
    gpu_type: 'RTX4090',
    gpu_count: 1,
    storage_gb: 500,
    region: 'us-east-1',
    project_id: teamMember.project_id,
    template_id: 'tmpl_jupyter_vscode',
    ssh_public_keys: [teamMember.ssh_key],
    port_mappings: [
      { internal_port: 8888, external_port: 0, description: 'Jupyter' },
      { internal_port: 8080, external_port: 0, description: 'VSCode' }
    ],
    auto_terminate: {
      enabled: true,
      idle_minutes: 120  // 2 hours idle timeout
    }
  };
  
  const response = await fetch('https://api.tensorone.ai/v1/clusters', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(config)
  });
  
  return await response.json();
}

Production Inference

Deploy production-ready inference clusters with load balancing and auto-scaling.
def create_inference_cluster(model_name, replicas=3):
    config = {
        "name": f"inference-{model_name}",
        "description": f"Production inference cluster for {model_name}",
        "gpu_type": "T4",
        "gpu_count": 1,
        "cpu_cores": 8,
        "memory_gb": 32,
        "storage_gb": 200,
        "region": "us-east-1",
        "docker_image": f"myregistry/models:{model_name}",
        "environment_variables": {
            "MODEL_NAME": model_name,
            "BATCH_SIZE": "8",
            "MAX_CONCURRENT": "10"
        },
        "port_mappings": [
            {
                "internal_port": 8000,
                "external_port": 80,
                "protocol": "tcp",
                "description": "API Endpoint"
            }
        ],
        "network_config": {
            "enable_load_balancer": True,
            "health_check_path": "/health"
        },
        "auto_terminate": {
            "enabled": False  # Keep running for production
        }
    }
    
    # Create multiple replicas
    clusters = []
    for i in range(replicas):
        config["name"] = f"inference-{model_name}-{i+1}"
        response = requests.post(
            "https://api.tensorone.ai/v1/clusters",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json=config
        )
        clusters.append(response.json()["data"])
    
    return clusters

Error Handling

{
  "success": false,
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid cluster configuration",
    "details": {
      "gpu_count": "Maximum 4 GPUs allowed for RTX4090",
      "storage_gb": "Minimum storage is 50GB",
      "region": "Region 'invalid-region' is not available"
    }
  }
}

Security Considerations

  • SSH Keys: Always use strong SSH key pairs and rotate them regularly
  • Network Security: Configure security groups and firewall rules appropriately
  • Environment Variables: Never store secrets in plain text; use encrypted secrets
  • Access Control: Ensure proper project-based access controls
  • Cost Monitoring: Implement cost alerts to prevent unexpected charges

Best Practices

  1. Resource Planning: Choose GPU types based on your specific workload requirements
  2. Cost Optimization: Use auto-termination to prevent runaway costs
  3. Data Management: Plan storage requirements and backup strategies
  4. Security: Implement proper access controls and network security
  5. Monitoring: Set up alerts for cluster status and performance metrics
  6. Template Usage: Use templates for consistent, repeatable deployments

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Body

application/json

Cluster configuration

The body is of type object.

Response

Cluster created successfully

The response is of type object.