Get Endpoint Health
curl --request GET \
  --url https://api.tensorone.ai/v2/endpoints/{endpointId}/health \
  --header 'Authorization: <api-key>'
{
  "status": "healthy",
  "lastCheck": "2023-11-07T05:31:56Z"
}
Monitor the health and readiness of your serverless endpoints with comprehensive health checks. This endpoint provides real-time status information about endpoint availability, resource health, and system readiness for processing requests.

Path Parameters

  • endpointId: The unique identifier of the endpoint to check health for

Query Parameters

  • check: Type of health check (basic, detailed, deep) - defaults to basic
  • timeout: Maximum time to wait for health check in seconds (1-30) - defaults to 10
  • include: Additional health metrics to include (dependencies, resources, connectivity)

Example Usage

Basic Health Check

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/health" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Detailed Health Check with Resources

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/health?check=detailed&include=resources,dependencies" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Deep Health Check with Full Diagnostics

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/health?check=deep&include=resources,dependencies,connectivity&timeout=30" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Batch Health Check for Multiple Endpoints

curl -X POST "https://api.tensorone.ai/v2/endpoints/health/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "endpointIds": [
      "ep_1234567890abcdef",
      "ep_2345678901bcdefg",
      "ep_3456789012cdefgh"
    ],
    "check": "detailed",
    "include": ["resources"]
  }'

Response

Healthy Endpoint Response

{
    "endpointId": "ep_1234567890abcdef",
    "status": "healthy",
    "readiness": "ready",
    "lastChecked": "2024-01-15T14:35:22Z",
    "responseTime": 87,
    "uptime": "72h 15m 30s",
    "version": "1.2.3",
    "checks": {
        "api": {
            "status": "healthy",
            "responseTime": 45,
            "lastError": null
        },
        "model": {
            "status": "healthy",
            "loadTime": 12.5,
            "memoryUsage": "8.2GB",
            "lastInference": "2024-01-15T14:34:15Z"
        },
        "dependencies": {
            "status": "healthy",
            "services": [
                {
                    "name": "model_storage",
                    "status": "healthy",
                    "responseTime": 23
                },
                {
                    "name": "result_cache",
                    "status": "healthy",
                    "responseTime": 12
                }
            ]
        }
    },
    "resources": {
        "gpu": {
            "status": "healthy",
            "utilization": 15,
            "memory": {
                "used": "2.1GB",
                "total": "40GB",
                "usage": 5.25
            },
            "temperature": 45,
            "errors": []
        },
        "cpu": {
            "status": "healthy",
            "utilization": 8,
            "cores": 16,
            "loadAverage": [0.5, 0.3, 0.2]
        },
        "memory": {
            "status": "healthy",
            "used": "12.8GB",
            "total": "64GB",
            "usage": 20
        },
        "storage": {
            "status": "healthy",
            "used": "45GB",
            "total": "500GB",
            "usage": 9,
            "iops": 150
        }
    },
    "metrics": {
        "requestsLastHour": 247,
        "averageLatency": 1.8,
        "errorRate": 0.2,
        "successRate": 99.8
    }
}

Unhealthy Endpoint Response

{
    "endpointId": "ep_unhealthy_example",
    "status": "unhealthy",
    "readiness": "not_ready",
    "lastChecked": "2024-01-15T14:35:22Z",
    "responseTime": 5000,
    "uptime": "2h 45m 12s",
    "version": "1.2.3",
    "issues": [
        {
            "type": "resource_constraint",
            "severity": "high",
            "component": "gpu_memory",
            "message": "GPU memory usage at 98%, may cause OOM errors",
            "timestamp": "2024-01-15T14:33:45Z",
            "recommendation": "Reduce batch size or scale up to larger GPU"
        },
        {
            "type": "dependency_failure",
            "severity": "medium",
            "component": "model_storage",
            "message": "Model storage service responding slowly",
            "timestamp": "2024-01-15T14:32:10Z",
            "recommendation": "Check storage service health"
        }
    ],
    "checks": {
        "api": {
            "status": "healthy",
            "responseTime": 125,
            "lastError": null
        },
        "model": {
            "status": "degraded",
            "loadTime": 12.5,
            "memoryUsage": "39.2GB",
            "lastInference": "2024-01-15T14:34:15Z",
            "warnings": ["High memory usage approaching limit"]
        },
        "dependencies": {
            "status": "degraded",
            "services": [
                {
                    "name": "model_storage",
                    "status": "degraded",
                    "responseTime": 1250,
                    "error": "High latency detected"
                },
                {
                    "name": "result_cache",
                    "status": "healthy",
                    "responseTime": 45
                }
            ]
        }
    },
    "resources": {
        "gpu": {
            "status": "warning",
            "utilization": 95,
            "memory": {
                "used": "39.2GB",
                "total": "40GB",
                "usage": 98
            },
            "temperature": 82,
            "errors": [],
            "warnings": ["High memory usage", "Elevated temperature"]
        }
    },
    "recoveryActions": [
        {
            "action": "restart_endpoint",
            "description": "Restart endpoint to clear memory leaks",
            "estimated_downtime": "2-3 minutes"
        },
        {
            "action": "scale_resources",
            "description": "Scale to higher memory GPU",
            "estimated_cost_increase": "20%"
        }
    ]
}

Endpoint Starting Response

{
    "endpointId": "ep_starting_example",
    "status": "starting",
    "readiness": "not_ready",
    "lastChecked": "2024-01-15T14:35:22Z",
    "responseTime": 0,
    "uptime": "0s",
    "version": "1.2.3",
    "startupProgress": {
        "phase": "loading_model",
        "progress": 65,
        "estimatedCompletion": "2024-01-15T14:37:00Z",
        "phases": [
            {
                "name": "container_startup",
                "status": "completed",
                "duration": 15.2
            },
            {
                "name": "dependency_loading",
                "status": "completed",
                "duration": 8.7
            },
            {
                "name": "loading_model",
                "status": "in_progress",
                "progress": 65,
                "estimatedRemaining": 45.3
            },
            {
                "name": "warmup_inference",
                "status": "pending",
                "estimatedDuration": 12.0
            }
        ]
    },
    "resources": {
        "gpu": {
            "status": "initializing",
            "utilization": 45,
            "memory": {
                "used": "8.2GB",
                "total": "40GB",
                "usage": 20.5
            }
        }
    }
}

Health Status Values

Overall Status

  • healthy: Endpoint is fully operational and ready to serve requests
  • degraded: Endpoint is operational but experiencing issues that may affect performance
  • unhealthy: Endpoint has critical issues and may not process requests reliably
  • starting: Endpoint is starting up and not yet ready
  • stopped: Endpoint is intentionally stopped
  • error: Endpoint is in an error state and requires intervention

Readiness Status

  • ready: Endpoint can accept and process requests immediately
  • not_ready: Endpoint cannot process requests (starting, errors, resource issues)
  • warming_up: Endpoint is ready but may have increased latency due to cold start

Component Status

  • healthy: Component is operating normally
  • degraded: Component is functional but with reduced performance
  • unhealthy: Component has critical issues
  • failed: Component is not functioning
  • unknown: Component status cannot be determined

Health Check Types

Basic Health Check

  • API endpoint responsiveness
  • Basic resource availability
  • Service uptime

Detailed Health Check

  • All basic checks plus:
  • Resource utilization metrics
  • Dependency service status
  • Performance metrics
  • Recent error rates

Deep Health Check

  • All detailed checks plus:
  • Full system diagnostics
  • Connectivity tests to all dependencies
  • Model inference test
  • Storage and network I/O tests

Readiness Probes

Kubernetes-Style Readiness

# Readiness probe endpoint for orchestration platforms
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/ready" \
  -H "Authorization: Bearer YOUR_API_KEY"

Readiness Response

{
    "ready": true,
    "checks": [
        {
            "name": "model_loaded",
            "status": "pass"
        },
        {
            "name": "resources_available",
            "status": "pass"
        },
        {
            "name": "dependencies_healthy",
            "status": "pass"
        }
    ],
    "readinessGates": {
        "model": true,
        "resources": true,
        "dependencies": true,
        "networking": true
    }
}

Liveness Probes

Basic Liveness Check

# Simple liveness probe
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/live" \
  -H "Authorization: Bearer YOUR_API_KEY"

Liveness Response

{
    "alive": true,
    "timestamp": "2024-01-15T14:35:22Z",
    "uptime": "72h 15m 30s",
    "version": "1.2.3"
}

Resource Health Monitoring

GPU Health Details

{
    "gpu": {
        "devices": [
            {
                "id": "gpu_0",
                "model": "NVIDIA A100-SXM4-40GB",
                "status": "healthy",
                "utilization": 25,
                "memory": {
                    "used": "8.2GB",
                    "total": "40GB",
                    "usage": 20.5
                },
                "temperature": 52,
                "power": {
                    "current": "180W",
                    "max": "400W",
                    "usage": 45
                },
                "errors": [],
                "warnings": [],
                "lastMaintenance": "2024-01-10T09:00:00Z"
            }
        ],
        "driver": {
            "version": "525.147.05",
            "status": "healthy"
        },
        "cuda": {
            "version": "12.2",
            "status": "healthy"
        }
    }
}

Storage Health Details

{
    "storage": {
        "volumes": [
            {
                "mount": "/models",
                "type": "ssd",
                "size": "500GB",
                "used": "45GB",
                "available": "455GB",
                "usage": 9,
                "iops": {
                    "read": 850,
                    "write": 450
                },
                "latency": {
                    "read": 0.8,
                    "write": 1.2
                },
                "status": "healthy"
            }
        ],
        "cache": {
            "size": "50GB",
            "used": "12GB",
            "hitRate": 94.5,
            "status": "healthy"
        }
    }
}

Error Handling

404 Endpoint Not Found

{
    "error": "ENDPOINT_NOT_FOUND",
    "message": "Endpoint ep_invalid does not exist",
    "details": {
        "endpointId": "ep_invalid",
        "suggestion": "Check endpoint ID or verify endpoint exists"
    }
}

503 Service Unavailable

{
    "error": "HEALTH_CHECK_FAILED",
    "message": "Health check service temporarily unavailable",
    "details": {
        "reason": "Health monitoring system overloaded",
        "retryAfter": 30,
        "fallbackStatus": "unknown"
    }
}

408 Timeout

{
    "error": "HEALTH_CHECK_TIMEOUT",
    "message": "Health check timed out after 30 seconds",
    "details": {
        "timeout": 30,
        "partialResults": {
            "api": "healthy",
            "model": "timeout",
            "dependencies": "unknown"
        },
        "recommendation": "Increase timeout or check endpoint performance"
    }
}

SDK Examples

Python SDK

from tensorone import TensorOneClient
import time
import asyncio
from datetime import datetime, timedelta

client = TensorOneClient(api_key="your_api_key")

# Basic health check
def check_endpoint_health(endpoint_id):
    health = client.endpoints.get_health(endpoint_id)
    print(f"Endpoint {endpoint_id}: {health.status}")
    
    if health.status != "healthy":
        print("Issues found:")
        for issue in health.issues:
            print(f"  - {issue.severity}: {issue.message}")
    
    return health

# Detailed health monitoring
def detailed_health_check(endpoint_id):
    health = client.endpoints.get_health(
        endpoint_id,
        check="detailed",
        include=["resources", "dependencies", "connectivity"]
    )
    
    print(f"Endpoint Health Report for {endpoint_id}")
    print(f"Status: {health.status}")
    print(f"Ready: {health.readiness}")
    print(f"Uptime: {health.uptime}")
    print(f"Response Time: {health.response_time}ms")
    
    # Resource health
    if health.resources:
        gpu = health.resources.gpu
        print(f"GPU: {gpu.utilization}% utilized, {gpu.memory.usage}% memory")
        
        if gpu.warnings:
            print("GPU Warnings:", ", ".join(gpu.warnings))
    
    # Dependency health
    if health.checks.dependencies:
        print("Dependencies:")
        for service in health.checks.dependencies.services:
            print(f"  {service.name}: {service.status} ({service.response_time}ms)")
    
    return health

# Continuous health monitoring
async def monitor_endpoint_health(endpoint_id, interval=60):
    """Monitor endpoint health continuously"""
    while True:
        try:
            health = client.endpoints.get_health(
                endpoint_id,
                check="detailed",
                include=["resources"]
            )
            
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            print(f"[{timestamp}] {endpoint_id}: {health.status}")
            
            # Alert on issues
            if health.status in ["unhealthy", "degraded"]:
                print("⚠️  Issues detected:")
                for issue in health.issues:
                    print(f"   {issue.severity}: {issue.message}")
                    
                # Check if restart is recommended
                if any(action.action == "restart_endpoint" for action in health.recovery_actions):
                    print("💡 Consider restarting endpoint")
            
            # Monitor resource usage
            if health.resources and health.resources.gpu:
                gpu = health.resources.gpu
                if gpu.memory.usage > 90:
                    print(f"🚨 High GPU memory usage: {gpu.memory.usage}%")
                if gpu.temperature > 80:
                    print(f"🌡️  High GPU temperature: {gpu.temperature}°C")
            
        except Exception as e:
            print(f"Health check failed: {e}")
        
        await asyncio.sleep(interval)

# Batch health checking
def check_multiple_endpoints(endpoint_ids):
    health_results = client.endpoints.get_batch_health(
        endpoint_ids=endpoint_ids,
        check="detailed",
        include=["resources"]
    )
    
    healthy_count = 0
    degraded_count = 0
    unhealthy_count = 0
    
    for health in health_results:
        if health.status == "healthy":
            healthy_count += 1
        elif health.status == "degraded":
            degraded_count += 1
        else:
            unhealthy_count += 1
    
    print(f"Health Summary:")
    print(f"  Healthy: {healthy_count}")
    print(f"  Degraded: {degraded_count}")
    print(f"  Unhealthy: {unhealthy_count}")
    
    return health_results

# Readiness waiting
def wait_for_endpoint_ready(endpoint_id, timeout=300):
    """Wait for endpoint to become ready with timeout"""
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        health = client.endpoints.get_health(endpoint_id)
        
        if health.readiness == "ready":
            print(f"Endpoint {endpoint_id} is ready!")
            return True
        elif health.status == "starting":
            progress = health.startup_progress
            if progress:
                print(f"Starting up: {progress.phase} ({progress.progress}%)")
        else:
            print(f"Current status: {health.status}")
        
        time.sleep(5)
    
    print(f"Timeout waiting for endpoint {endpoint_id} to become ready")
    return False

# Usage examples
if __name__ == "__main__":
    endpoint_id = "ep_1234567890abcdef"
    
    # Basic health check
    health = check_endpoint_health(endpoint_id)
    
    # Detailed health check
    detailed_health_check(endpoint_id)
    
    # Wait for readiness
    if wait_for_endpoint_ready(endpoint_id):
        print("Endpoint is ready for requests!")
    
    # Monitor multiple endpoints
    endpoints = ["ep_1234567890abcdef", "ep_2345678901bcdefg"]
    check_multiple_endpoints(endpoints)
    
    # Start continuous monitoring (uncomment to run)
    # asyncio.run(monitor_endpoint_health(endpoint_id))

JavaScript SDK

import { TensorOneClient } from "@tensorone/sdk";

const client = new TensorOneClient({ apiKey: "your_api_key" });

// Basic health check
async function checkEndpointHealth(endpointId) {
    const health = await client.endpoints.getHealth(endpointId);
    console.log(`Endpoint ${endpointId}: ${health.status}`);
    
    if (health.status !== "healthy") {
        console.log("Issues found:");
        health.issues?.forEach(issue => {
            console.log(`  - ${issue.severity}: ${issue.message}`);
        });
    }
    
    return health;
}

// Detailed health monitoring with real-time updates
async function monitorEndpointHealth(endpointId, options = {}) {
    const { interval = 60000, alertThresholds = {} } = options;
    
    const monitor = setInterval(async () => {
        try {
            const health = await client.endpoints.getHealth(endpointId, {
                check: "detailed",
                include: ["resources", "dependencies"]
            });
            
            const timestamp = new Date().toISOString();
            console.log(`[${timestamp}] ${endpointId}: ${health.status}`);
            
            // Resource monitoring with alerts
            if (health.resources?.gpu) {
                const gpu = health.resources.gpu;
                const memoryThreshold = alertThresholds.gpuMemory || 90;
                const tempThreshold = alertThresholds.gpuTemp || 80;
                
                if (gpu.memory.usage > memoryThreshold) {
                    console.warn(`🚨 High GPU memory: ${gpu.memory.usage}%`);
                }
                
                if (gpu.temperature > tempThreshold) {
                    console.warn(`🌡️ High GPU temp: ${gpu.temperature}°C`);
                }
            }
            
            // Performance monitoring
            if (health.metrics) {
                const errorRate = health.metrics.errorRate;
                const latency = health.metrics.averageLatency;
                
                if (errorRate > 5) {
                    console.warn(`📊 High error rate: ${errorRate}%`);
                }
                
                if (latency > 10) {
                    console.warn(`⏱️ High latency: ${latency}s`);
                }
            }
            
            // Dependency monitoring
            if (health.checks?.dependencies?.services) {
                const unhealthyDeps = health.checks.dependencies.services
                    .filter(service => service.status !== "healthy");
                
                if (unhealthyDeps.length > 0) {
                    console.warn("🔗 Unhealthy dependencies:");
                    unhealthyDeps.forEach(dep => {
                        console.warn(`   ${dep.name}: ${dep.status}`);
                    });
                }
            }
            
        } catch (error) {
            console.error(`Health check failed for ${endpointId}:`, error);
        }
    }, interval);
    
    return () => clearInterval(monitor);
}

// Readiness polling with async/await
async function waitForEndpointReady(endpointId, options = {}) {
    const { timeout = 300000, pollInterval = 5000 } = options;
    const startTime = Date.now();
    
    while (Date.now() - startTime < timeout) {
        try {
            const health = await client.endpoints.getHealth(endpointId);
            
            if (health.readiness === "ready") {
                console.log(`✅ Endpoint ${endpointId} is ready!`);
                return true;
            }
            
            if (health.status === "starting" && health.startupProgress) {
                const progress = health.startupProgress;
                console.log(`⏳ Starting: ${progress.phase} (${progress.progress}%)`);
            } else {
                console.log(`📊 Status: ${health.status}, Readiness: ${health.readiness}`);
            }
            
            await new Promise(resolve => setTimeout(resolve, pollInterval));
            
        } catch (error) {
            console.error(`Error checking readiness:`, error);
            await new Promise(resolve => setTimeout(resolve, pollInterval));
        }
    }
    
    console.error(`❌ Timeout waiting for ${endpointId} to become ready`);
    return false;
}

// Batch health checking with Promise.all
async function checkMultipleEndpoints(endpointIds, options = {}) {
    const { check = "basic", include = [] } = options;
    
    try {
        const healthChecks = await Promise.allSettled(
            endpointIds.map(id => 
                client.endpoints.getHealth(id, { check, include })
            )
        );
        
        const results = {
            healthy: 0,
            degraded: 0,
            unhealthy: 0,
            errors: 0
        };
        
        healthChecks.forEach((result, index) => {
            const endpointId = endpointIds[index];
            
            if (result.status === "fulfilled") {
                const health = result.value;
                results[health.status] = (results[health.status] || 0) + 1;
                
                console.log(`${endpointId}: ${health.status}`);
                
                if (health.status !== "healthy") {
                    health.issues?.forEach(issue => {
                        console.log(`  ⚠️ ${issue.severity}: ${issue.message}`);
                    });
                }
            } else {
                results.errors++;
                console.error(`${endpointId}: Error - ${result.reason}`);
            }
        });
        
        console.log("\n📊 Health Summary:");
        Object.entries(results).forEach(([status, count]) => {
            if (count > 0) {
                console.log(`  ${status}: ${count}`);
            }
        });
        
        return healthChecks;
        
    } catch (error) {
        console.error("Batch health check failed:", error);
        throw error;
    }
}

// Health-based auto-scaling trigger
async function autoScaleBasedOnHealth(endpointIds, options = {}) {
    const { 
        scaleUpThreshold = 80,  // GPU utilization %
        scaleDownThreshold = 20,
        minInstances = 1,
        maxInstances = 10
    } = options;
    
    const healthResults = await Promise.all(
        endpointIds.map(id => client.endpoints.getHealth(id, {
            check: "detailed",
            include: ["resources"]
        }))
    );
    
    let scaleRecommendations = [];
    
    healthResults.forEach(health => {
        if (!health.resources?.gpu) return;
        
        const gpuUtil = health.resources.gpu.utilization;
        const memUtil = health.resources.gpu.memory.usage;
        
        if (gpuUtil > scaleUpThreshold || memUtil > 90) {
            scaleRecommendations.push({
                endpointId: health.endpointId,
                action: "scale_up",
                reason: `High utilization: GPU ${gpuUtil}%, Memory ${memUtil}%`,
                priority: memUtil > 95 ? "high" : "medium"
            });
        } else if (gpuUtil < scaleDownThreshold && memUtil < 30) {
            scaleRecommendations.push({
                endpointId: health.endpointId,
                action: "scale_down",
                reason: `Low utilization: GPU ${gpuUtil}%, Memory ${memUtil}%`,
                priority: "low"
            });
        }
    });
    
    return scaleRecommendations;
}

// Usage examples
async function main() {
    const endpointId = "ep_1234567890abcdef";
    const endpointIds = ["ep_1234567890abcdef", "ep_2345678901bcdefg"];
    
    try {
        // Basic health check
        await checkEndpointHealth(endpointId);
        
        // Wait for readiness
        const isReady = await waitForEndpointReady(endpointId);
        if (isReady) {
            console.log("Endpoint is ready for requests!");
        }
        
        // Check multiple endpoints
        await checkMultipleEndpoints(endpointIds, {
            check: "detailed",
            include: ["resources"]
        });
        
        // Auto-scaling recommendations
        const scaleRecs = await autoScaleBasedOnHealth(endpointIds);
        if (scaleRecs.length > 0) {
            console.log("Scaling recommendations:");
            scaleRecs.forEach(rec => {
                console.log(`  ${rec.endpointId}: ${rec.action} - ${rec.reason}`);
            });
        }
        
        // Start continuous monitoring (uncomment to run)
        // const stopMonitoring = await monitorEndpointHealth(endpointId, {
        //     interval: 30000,
        //     alertThresholds: { gpuMemory: 85, gpuTemp: 75 }
        // });
        
        // Stop monitoring after 5 minutes
        // setTimeout(stopMonitoring, 5 * 60 * 1000);
        
    } catch (error) {
        console.error("Health monitoring error:", error);
    }
}

main();

Use Cases

Production Monitoring

  • Service Reliability: Monitor endpoint health in production environments
  • Automated Alerts: Set up alerting based on health status changes
  • Load Balancing: Route traffic away from unhealthy endpoints
  • Capacity Planning: Monitor resource utilization trends

CI/CD Integration

  • Deployment Validation: Verify endpoint health after deployments
  • Rollback Triggers: Automatically rollback on health failures
  • Readiness Gates: Wait for endpoints to be ready before promoting traffic
  • Health-based Testing: Run tests only when endpoints are healthy

Auto-scaling and Orchestration

  • Kubernetes Integration: Use as readiness and liveness probes
  • Auto-scaling Triggers: Scale based on resource health metrics
  • Failover Systems: Detect failures and switch to backup endpoints
  • Maintenance Windows: Schedule maintenance based on health patterns

Development and Debugging

  • Performance Optimization: Identify performance bottlenecks
  • Resource Monitoring: Track resource usage during development
  • Dependency Validation: Ensure all dependencies are healthy
  • Cold Start Analysis: Monitor startup performance and optimization

Best Practices

Health Check Strategy

  • Regular Monitoring: Implement regular health checks with appropriate intervals
  • Graduated Checks: Use basic checks for frequent monitoring, detailed for diagnostics
  • Timeout Management: Set appropriate timeouts based on expected response times
  • Error Handling: Implement graceful handling of health check failures

Performance Considerations

  • Check Frequency: Balance monitoring frequency with system load
  • Batch Operations: Use batch health checks for multiple endpoints
  • Caching: Cache health results for non-critical monitoring
  • Selective Inclusion: Only request detailed metrics when needed

Alert Configuration

  • Threshold Setting: Set appropriate thresholds for different severity levels
  • Alert Fatigue: Prevent alert fatigue with intelligent alerting
  • Escalation Paths: Define clear escalation procedures for different issue types
  • Recovery Actions: Implement automated recovery actions where appropriate

Integration Patterns

  • Circuit Breakers: Use health status to trigger circuit breaker patterns
  • Service Mesh: Integrate with service mesh health checking
  • Monitoring Tools: Export health metrics to monitoring and observability tools
  • Documentation: Document health check interpretations and response procedures
Health checks are cached for 30 seconds to reduce system load. For real-time status updates, use the streaming endpoints or webhook notifications.
Deep health checks consume more resources and should be used sparingly in production. Use basic or detailed checks for regular monitoring.
Set up automated recovery actions based on health status to reduce manual intervention and improve system reliability. Consider implementing circuit breaker patterns for improved resilience.

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

endpointId
string
required

Response

200 - application/json

Endpoint health status

The response is of type object.