Overview
The Restart Cluster endpoint allows you to restart running or stopped GPU clusters, optionally applying configuration updates during the restart process. This is useful for applying system updates, changing configurations, or recovering from errors while preserving data and work state.Endpoint
Path Parameters
Parameter | Type | Required | Description |
---|---|---|---|
cluster_id | string | Yes | Unique cluster identifier |
Request Body
Parameter | Type | Required | Description |
---|---|---|---|
force | boolean | No | Force restart without graceful shutdown (default: false) |
grace_period_minutes | integer | No | Grace period for graceful shutdown (default: 5, max: 30) |
wait_for_ready | boolean | No | Wait for cluster to be fully ready after restart (default: false) |
timeout_minutes | integer | No | Maximum wait time for completion (default: 15, max: 60) |
preserve_state | boolean | No | Preserve running processes and state (default: false) |
update_configuration | object | No | Configuration updates to apply during restart |
environment_updates | object | No | Environment variable updates |
port_mapping_updates | array | No | Port mapping changes |
restart_reason | string | No | Reason for restart (for audit logs) |
restore_from_snapshot | string | No | Snapshot ID to restore from during restart |
update_docker_image | string | No | New Docker image to use after restart |
apply_system_updates | boolean | No | Apply pending system updates (default: false) |
Configuration Updates
Request Examples
Response Schema
Restart Progress Phases
Phase | Description | Typical Duration |
---|---|---|
stopping_processes | Gracefully stopping running processes | 1-10 minutes |
creating_checkpoint | Creating state checkpoint (if preserve_state=true) | 30s-5 minutes |
updating_configuration | Applying hardware/software updates | 1-3 minutes |
starting_services | Starting system services and Docker containers | 30s-2 minutes |
health_checks | Running health checks and validation | 30s-1 minute |
ready | Cluster is fully operational | - |
Use Cases
System Updates and Maintenance
Apply system updates and security patches with minimal downtime.Model Version Deployment
Deploy new model versions with configuration updates.Recovery from Errors
Restart clusters to recover from errors with optional state restoration.Error Handling
Security Considerations
- State Preservation: Be cautious when preserving state during security updates
- Configuration Validation: Validate all configuration changes before restart
- Access Control: Ensure proper permissions for configuration modifications
- Audit Logging: Log restart reasons and configuration changes for compliance
Best Practices
- Graceful Restarts: Use graceful restarts unless emergency recovery is needed
- Configuration Testing: Test configuration changes in development first
- State Management: Use snapshots for critical workloads before major changes
- Monitoring: Monitor restart progress and validate successful completion
- Rollback Planning: Have rollback procedures ready for failed deployments
- Resource Planning: Consider resource availability during restart operations
Authorizations
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Path Parameters
Response
200 - application/json
Cluster restart initiated
The response is of type object
.