Update the configuration of an existing serverless endpoint to modify scaling parameters, worker count, environment variables, and other settings.
Path Parameters
endpointId
: The unique identifier of the endpoint to update
Request Body
{
"workerCount": 3,
"maxConcurrency": 15,
"timeoutSeconds": 600,
"environmentVariables": {
"MODEL_PRECISION": "fp16",
"BATCH_SIZE": "4",
"CACHE_SIZE": "2048"
},
"autoScaling": {
"enabled": true,
"minWorkers": 1,
"maxWorkers": 10,
"targetUtilization": 70
}
}
Example Usage
Update Worker Count and Timeout
curl -X PUT "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"workerCount": 5,
"timeoutSeconds": 900,
"maxConcurrency": 20
}'
Enable Auto-Scaling
curl -X PUT "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"autoScaling": {
"enabled": true,
"minWorkers": 2,
"maxWorkers": 8,
"targetUtilization": 75,
"scaleUpCooldown": 60,
"scaleDownCooldown": 300
}
}'
Update Environment Variables
curl -X PUT "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"environmentVariables": {
"MODEL_VERSION": "v2.1",
"ENABLE_LOGGING": "true",
"MAX_BATCH_SIZE": "8"
}
}'
Response
Returns the updated endpoint configuration:
{
"id": "ep_1234567890abcdef",
"name": "my-text-generation-model",
"status": "updating",
"configuration": {
"workerCount": 5,
"maxConcurrency": 20,
"timeoutSeconds": 900,
"environmentVariables": {
"MODEL_VERSION": "v2.1",
"ENABLE_LOGGING": "true",
"MAX_BATCH_SIZE": "8"
},
"autoScaling": {
"enabled": true,
"minWorkers": 2,
"maxWorkers": 8,
"targetUtilization": 75,
"scaleUpCooldown": 60,
"scaleDownCooldown": 300
}
},
"updatedAt": "2024-01-15T14:30:00Z"
}
Configuration Parameters
Scaling Configuration
workerCount
: Number of workers (1-50)
maxConcurrency
: Maximum concurrent requests per worker
timeoutSeconds
: Request timeout in seconds (10-3600)
Auto-Scaling Configuration
enabled
: Enable/disable auto-scaling
minWorkers
: Minimum number of workers
maxWorkers
: Maximum number of workers
targetUtilization
: Target CPU utilization percentage (10-90)
scaleUpCooldown
: Cooldown period before scaling up (seconds)
scaleDownCooldown
: Cooldown period before scaling down (seconds)
Environment Variables
- Custom key-value pairs passed to your model runtime
- Useful for model configuration, feature flags, and runtime parameters
Error Handling
400 Bad Request
{
"error": "INVALID_CONFIGURATION",
"message": "Worker count exceeds maximum limit",
"details": {
"field": "workerCount",
"value": 100,
"maximum": 50
}
}
409 Conflict
{
"error": "ENDPOINT_BUSY",
"message": "Cannot update endpoint while executing requests",
"details": {
"activeRequests": 15,
"suggestion": "Wait for active requests to complete or force update"
}
}
SDK Examples
Python SDK
from tensorone import TensorOneClient
client = TensorOneClient(api_key="your_api_key")
# Update worker configuration
endpoint = client.endpoints.update(
endpoint_id="ep_1234567890abcdef",
worker_count=5,
max_concurrency=20,
timeout_seconds=900
)
# Enable auto-scaling
endpoint = client.endpoints.update(
endpoint_id="ep_1234567890abcdef",
auto_scaling={
"enabled": True,
"min_workers": 2,
"max_workers": 8,
"target_utilization": 75
}
)
# Update environment variables
endpoint = client.endpoints.update(
endpoint_id="ep_1234567890abcdef",
environment_variables={
"MODEL_VERSION": "v2.1",
"ENABLE_CACHING": "true"
}
)
JavaScript SDK
import { TensorOneClient } from "@tensorone/sdk";
const client = new TensorOneClient({ apiKey: "your_api_key" });
// Update scaling configuration
const endpoint = await client.endpoints.update("ep_1234567890abcdef", {
workerCount: 5,
maxConcurrency: 20,
timeoutSeconds: 900,
});
// Enable auto-scaling with custom parameters
const scaledEndpoint = await client.endpoints.update("ep_1234567890abcdef", {
autoScaling: {
enabled: true,
minWorkers: 2,
maxWorkers: 8,
targetUtilization: 75,
scaleUpCooldown: 60,
scaleDownCooldown: 300,
},
});
Best Practices
- Worker Count: Start with 2-3 workers and scale based on demand
- Concurrency: Set maxConcurrency based on your model’s memory requirements
- Timeouts: Use shorter timeouts for interactive applications, longer for batch processing
- Auto-Scaling: Enable for variable workloads to optimize costs
Cost Optimization
- Right-Sizing: Monitor utilization and adjust worker count accordingly
- Auto-Scaling: Use to automatically scale down during low traffic periods
- Environment Variables: Use to enable/disable expensive features dynamically
Deployment Strategy
- Gradual Updates: Update configuration during low-traffic periods
- Testing: Test configuration changes on development endpoints first
- Monitoring: Monitor metrics after updates to ensure desired performance
Configuration updates are applied gradually to minimize service disruption. The endpoint status will show updating
during the transition period.
Reducing worker count may temporarily increase latency as traffic redistributes. Plan updates during low-traffic
periods.
Use environment variables to enable A/B testing by toggling features without redeploying your model.
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Endpoint updated successfully
The response is of type object
.