Execution Logs - Tensor One

Access comprehensive execution logs and debugging information for your serverless endpoints. Monitor application behavior, troubleshoot issues, and gain insights into execution patterns with structured logging and real-time log streaming.

Path Parameters

endpointId: The unique identifier of the endpoint to retrieve logs for

Query Parameters

startTime: Start time for log retrieval (ISO 8601 format, e.g., 2024-01-15T10:30:00Z)
endTime: End time for log retrieval (ISO 8601 format, defaults to current time)
level: Log level filter (debug, info, warn, error, fatal) - defaults to info
limit: Maximum number of log entries to return (1-10000) - defaults to 1000
offset: Number of log entries to skip for pagination - defaults to 0
jobId: Filter logs for a specific job execution
search: Search term to filter log messages (supports regex)
format: Response format (json, text, csv) - defaults to json
stream: Enable real-time log streaming (true, false) - defaults to false

Example Usage

Basic Log Retrieval

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Filtered Logs with Time Range

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?startTime=2024-01-15T10:00:00Z&endTime=2024-01-15T11:00:00Z&level=error" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Job-Specific Logs

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?jobId=job_1234567890abcdef&level=debug" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Search Logs with Pattern

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=memory.*error&level=warn" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Real-time Log Streaming

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?stream=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Accept: text/event-stream"

Export Logs in CSV Format

curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?format=csv&startTime=2024-01-15T00:00:00Z&endTime=2024-01-15T23:59:59Z" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Accept: text/csv"

Response

JSON Format Response

{
    "endpointId": "ep_1234567890abcdef",
    "totalLogs": 1247,
    "filteredLogs": 856,
    "startTime": "2024-01-15T10:00:00Z",
    "endTime": "2024-01-15T11:00:00Z",
    "pagination": {
        "limit": 1000,
        "offset": 0,
        "hasMore": false,
        "nextOffset": null
    },
    "logs": [
        {
            "timestamp": "2024-01-15T10:30:15.234Z",
            "level": "info",
            "message": "Processing image generation request",
            "jobId": "job_1234567890abcdef",
            "requestId": "req_abc123def456",
            "source": "model_inference",
            "metadata": {
                "prompt": "A sunset over mountains",
                "model": "stable-diffusion-xl",
                "parameters": {
                    "steps": 30,
                    "guidance_scale": 7.5,
                    "width": 1024,
                    "height": 1024
                }
            },
            "duration": null,
            "memoryUsage": "8.2GB",
            "gpuUtilization": 87
        },
        {
            "timestamp": "2024-01-15T10:30:18.567Z",
            "level": "debug",
            "message": "Model loaded successfully",
            "jobId": "job_1234567890abcdef",
            "requestId": "req_abc123def456",
            "source": "model_loader",
            "metadata": {
                "modelPath": "/models/stable-diffusion-xl/model.safetensors",
                "loadTime": 3.2,
                "modelSize": "6.9GB",
                "precision": "fp16"
            },
            "duration": 3.2,
            "memoryUsage": "6.9GB",
            "gpuUtilization": 45
        },
        {
            "timestamp": "2024-01-15T10:30:22.891Z",
            "level": "info",
            "message": "Image generation completed",
            "jobId": "job_1234567890abcdef",
            "requestId": "req_abc123def456",
            "source": "model_inference",
            "metadata": {
                "outputPath": "/tmp/generated_image_abc123.png",
                "generationTime": 4.3,
                "seed": 42,
                "finalSteps": 30
            },
            "duration": 4.3,
            "memoryUsage": "8.2GB",
            "gpuUtilization": 92
        },
        {
            "timestamp": "2024-01-15T10:30:25.123Z",
            "level": "warn",
            "message": "High GPU memory usage detected",
            "jobId": "job_1234567890abcdef",
            "requestId": "req_abc123def456",
            "source": "resource_monitor",
            "metadata": {
                "currentUsage": "38.5GB",
                "totalMemory": "40GB",
                "utilizationPercent": 96.25,
                "recommendation": "Consider reducing batch size or image resolution"
            },
            "duration": null,
            "memoryUsage": "38.5GB",
            "gpuUtilization": 96
        },
        {
            "timestamp": "2024-01-15T10:30:28.456Z",
            "level": "error",
            "message": "Failed to upload result to storage",
            "jobId": "job_1234567890abcdef",
            "requestId": "req_abc123def456",
            "source": "storage_uploader",
            "metadata": {
                "error": "Connection timeout after 30 seconds",
                "errorCode": "STORAGE_TIMEOUT",
                "retryAttempt": 1,
                "maxRetries": 3,
                "filePath": "/tmp/generated_image_abc123.png",
                "fileSize": "2.4MB"
            },
            "duration": 30.0,
            "stackTrace": [
                "at StorageUploader.upload (storage.js:45:12)",
                "at async ImageProcessor.saveResult (processor.js:128:8)",
                "at async handleRequest (handler.js:67:4)"
            ]
        }
    ],
    "summary": {
        "logLevels": {
            "debug": 234,
            "info": 456,
            "warn": 123,
            "error": 43,
            "fatal": 0
        },
        "sources": {
            "model_inference": 245,
            "model_loader": 67,
            "resource_monitor": 156,
            "storage_uploader": 89,
            "api_handler": 299
        },
        "commonErrors": [
            {
                "error": "STORAGE_TIMEOUT",
                "count": 12,
                "firstOccurrence": "2024-01-15T10:15:30Z",
                "lastOccurrence": "2024-01-15T10:55:12Z"
            },
            {
                "error": "MEMORY_LIMIT_EXCEEDED",
                "count": 8,
                "firstOccurrence": "2024-01-15T10:22:45Z",
                "lastOccurrence": "2024-01-15T10:48:30Z"
            }
        ]
    }
}

Error Logs with Stack Traces

{
    "timestamp": "2024-01-15T10:35:42.789Z",
    "level": "error",
    "message": "Model inference failed with CUDA out of memory error",
    "jobId": "job_error_example",
    "requestId": "req_error_456",
    "source": "model_inference",
    "metadata": {
        "error": "CUDA out of memory. Tried to allocate 2.50 GiB",
        "errorCode": "CUDA_OOM",
        "gpuMemoryUsed": "39.2GB",
        "gpuMemoryTotal": "40GB",
        "requestedAllocation": "2.5GB",
        "model": "llama-2-70b",
        "batchSize": 4,
        "sequenceLength": 2048
    },
    "stackTrace": [
        "RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB",
        "  at torch.cuda.OutOfMemoryError",
        "  at model_inference.py:156 in forward()",
        "  at inference_handler.py:89 in process_batch()",
        "  at main.py:45 in handle_request()"
    ],
    "context": {
        "previousRequests": [
            {
                "requestId": "req_456789",
                "memoryUsage": "35.8GB",
                "status": "completed"
            },
            {
                "requestId": "req_567890",
                "memoryUsage": "37.1GB",
                "status": "completed"
            }
        ],
        "systemState": {
            "availableMemory": "0.8GB",
            "activeProcesses": 3,
            "cacheSize": "12.4GB"
        }
    },
    "recoveryActions": [
        {
            "type": "memory_cleanup",
            "description": "Clear model cache and retry",
            "executed": true,
            "result": "freed 8.2GB memory"
        },
        {
            "type": "batch_size_reduction",
            "description": "Reduce batch size from 4 to 2",
            "executed": true,
            "result": "retry successful"
        }
    ]
}

Performance Logs

{
    "timestamp": "2024-01-15T10:40:15.123Z",
    "level": "info",
    "message": "Request processing completed",
    "jobId": "job_perf_example",
    "requestId": "req_perf_789",
    "source": "performance_tracker",
    "metadata": {
        "totalDuration": 12.5,
        "phases": {
            "queueTime": 0.2,
            "coldStartTime": 0.0,
            "modelLoadTime": 0.0,
            "inferenceTime": 11.8,
            "postProcessingTime": 0.3,
            "uploadTime": 0.2
        },
        "resourceUsage": {
            "peakGpuMemory": "32.1GB",
            "peakGpuUtilization": 94,
            "avgCpuUsage": 45,
            "networkIO": {
                "ingress": "125MB",
                "egress": "8.2MB"
            }
        },
        "optimizations": {
            "cacheHit": true,
            "modelReused": true,
            "batchProcessed": false
        }
    },
    "benchmarks": {
        "targetLatency": 10.0,
        "actualLatency": 12.5,
        "performance": "within_sla",
        "percentile": "p85"
    }
}

Text Format Response

2024-01-15T10:30:15.234Z [INFO] model_inference: Processing image generation request (job_1234567890abcdef)
2024-01-15T10:30:18.567Z [DEBUG] model_loader: Model loaded successfully in 3.2s (job_1234567890abcdef)
2024-01-15T10:30:22.891Z [INFO] model_inference: Image generation completed in 4.3s (job_1234567890abcdef)
2024-01-15T10:30:25.123Z [WARN] resource_monitor: High GPU memory usage detected: 96.25% (job_1234567890abcdef)
2024-01-15T10:30:28.456Z [ERROR] storage_uploader: Failed to upload result to storage: Connection timeout (job_1234567890abcdef)

Real-time Streaming Response

data: {"timestamp":"2024-01-15T10:45:00.123Z","level":"info","message":"New request received","jobId":"job_live_stream","source":"api_handler"}

data: {"timestamp":"2024-01-15T10:45:01.456Z","level":"debug","message":"Loading model weights","jobId":"job_live_stream","source":"model_loader"}

data: {"timestamp":"2024-01-15T10:45:05.789Z","level":"info","message":"Model ready for inference","jobId":"job_live_stream","source":"model_loader"}

data: {"timestamp":"2024-01-15T10:45:08.012Z","level":"info","message":"Processing completed successfully","jobId":"job_live_stream","source":"model_inference"}

Log Levels

Level Hierarchy

debug: Detailed diagnostic information for development and troubleshooting
info: General informational messages about normal operation
warn: Warning messages for potentially problematic situations
error: Error messages for failures that don’t stop execution
fatal: Critical errors that cause execution to stop

Log Sources

api_handler: API request handling and routing
model_loader: Model loading and initialization
model_inference: Model execution and inference
resource_monitor: System resource monitoring
storage_uploader: File upload and storage operations
cache_manager: Caching system operations
auto_scaler: Auto-scaling events and decisions

Advanced Filtering

Search Patterns

# Search for memory-related errors
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=memory.*error|OOM|out.*memory" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Search for specific model operations
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=model_(load|unload|inference)" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Search for performance issues
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=timeout|slow|latency|performance" \
  -H "Authorization: Bearer YOUR_API_KEY"

Complex Filtering

curl -X POST "https://api.tensorone.ai/v2/endpoints/logs/search" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "endpointIds": ["ep_1234567890abcdef"],
    "filters": {
      "timeRange": {
        "start": "2024-01-15T10:00:00Z",
        "end": "2024-01-15T11:00:00Z"
      },
      "levels": ["warn", "error", "fatal"],
      "sources": ["model_inference", "resource_monitor"],
      "search": {
        "query": "memory.*usage|GPU.*utilization",
        "caseSensitive": false
      },
      "metadata": {
        "gpuUtilization": {"$gt": 90},
        "memoryUsage": {"$regex": "3[0-9]\\.[0-9]GB"}
      }
    },
    "sort": {
      "field": "timestamp",
      "order": "desc"
    },
    "limit": 500
  }'

Log Aggregation

Log Aggregation Endpoint

curl -X POST "https://api.tensorone.ai/v2/endpoints/logs/aggregate" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "endpointIds": ["ep_1234567890abcdef"],
    "timeRange": {
      "start": "2024-01-15T00:00:00Z",
      "end": "2024-01-15T23:59:59Z"
    },
    "aggregations": [
      {
        "name": "errors_by_hour",
        "groupBy": ["level", "hour"],
        "filters": {"level": ["error", "fatal"]},
        "metrics": ["count", "unique_jobs"]
      },
      {
        "name": "performance_metrics",
        "groupBy": ["source"],
        "metrics": ["avg_duration", "p95_duration", "count"]
      }
    ]
  }'

Aggregation Response

{
    "aggregations": {
        "errors_by_hour": [
            {
                "level": "error",
                "hour": "2024-01-15T10:00:00Z",
                "count": 23,
                "unique_jobs": 18
            },
            {
                "level": "error",
                "hour": "2024-01-15T11:00:00Z",
                "count": 15,
                "unique_jobs": 12
            }
        ],
        "performance_metrics": [
            {
                "source": "model_inference",
                "count": 1247,
                "avg_duration": 8.5,
                "p95_duration": 15.2
            },
            {
                "source": "model_loader",
                "count": 89,
                "avg_duration": 12.3,
                "p95_duration": 25.8
            }
        ]
    }
}

Error Handling

400 Bad Request

{
    "error": "INVALID_TIME_RANGE",
    "message": "Start time must be before end time",
    "details": {
        "startTime": "2024-01-15T12:00:00Z",
        "endTime": "2024-01-15T10:00:00Z",
        "maxTimeRange": "24h"
    }
}

403 Forbidden

{
    "error": "INSUFFICIENT_PERMISSIONS",
    "message": "Logs access requires logs:read permission",
    "details": {
        "requiredPermission": "logs:read",
        "currentPermissions": ["endpoints:execute"]
    }
}

413 Payload Too Large

{
    "error": "TOO_MANY_LOGS",
    "message": "Requested log range contains too many entries",
    "details": {
        "requestedCount": 50000,
        "maxAllowed": 10000,
        "suggestion": "Use smaller time ranges or increase pagination"
    }
}

429 Rate Limited

{
    "error": "RATE_LIMIT_EXCEEDED",
    "message": "Too many log requests",
    "details": {
        "limit": 100,
        "window": "1h",
        "retryAfter": 60,
        "suggestion": "Use streaming for real-time logs"
    }
}

SDK Examples

Python SDK

from tensorone import TensorOneClient
import json
import time
from datetime import datetime, timedelta
import pandas as pd

client = TensorOneClient(api_key="your_api_key")

# Basic log retrieval
def get_endpoint_logs(endpoint_id, hours_back=1):
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(hours=hours_back)
    
    logs = client.endpoints.get_logs(
        endpoint_id=endpoint_id,
        start_time=start_time.isoformat() + 'Z',
        end_time=end_time.isoformat() + 'Z',
        level='info',
        limit=1000
    )
    
    print(f"Retrieved {len(logs.logs)} logs for {endpoint_id}")
    
    # Display summary
    summary = logs.summary
    print(f"Log levels: {summary.log_levels}")
    print(f"Sources: {summary.sources}")
    
    return logs

# Error analysis
def analyze_errors(endpoint_id, hours_back=24):
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(hours=hours_back)
    
    # Get error logs
    error_logs = client.endpoints.get_logs(
        endpoint_id=endpoint_id,
        start_time=start_time.isoformat() + 'Z',
        end_time=end_time.isoformat() + 'Z',
        level='error',
        limit=5000
    )
    
    print(f"Error Analysis for {endpoint_id}")
    print(f"Total errors in last {hours_back} hours: {len(error_logs.logs)}")
    
    # Analyze error patterns
    error_codes = {}
    error_sources = {}
    
    for log in error_logs.logs:
        # Count error codes
        if log.metadata and 'errorCode' in log.metadata:
            code = log.metadata['errorCode']
            error_codes[code] = error_codes.get(code, 0) + 1
        
        # Count error sources
        source = log.source
        error_sources[source] = error_sources.get(source, 0) + 1
    
    print("\nTop Error Codes:")
    for code, count in sorted(error_codes.items(), key=lambda x: x[1], reverse=True):
        print(f"  {code}: {count}")
    
    print("\nErrors by Source:")
    for source, count in sorted(error_sources.items(), key=lambda x: x[1], reverse=True):
        print(f"  {source}: {count}")
    
    return error_logs

# Performance analysis from logs
def analyze_performance(endpoint_id, hours_back=6):
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(hours=hours_back)
    
    # Get performance-related logs
    perf_logs = client.endpoints.get_logs(
        endpoint_id=endpoint_id,
        start_time=start_time.isoformat() + 'Z',
        end_time=end_time.isoformat() + 'Z',
        search='duration|latency|performance|completed',
        level='info'
    )
    
    durations = []
    memory_usage = []
    gpu_utilization = []
    
    for log in perf_logs.logs:
        if log.duration:
            durations.append(log.duration)
        
        if log.metadata:
            # Extract memory usage
            if 'memoryUsage' in log.metadata:
                memory_str = log.metadata['memoryUsage']
                if 'GB' in memory_str:
                    memory_val = float(memory_str.replace('GB', ''))
                    memory_usage.append(memory_val)
            
            # Extract GPU utilization
            if 'gpuUtilization' in log.metadata:
                gpu_utilization.append(log.metadata['gpuUtilization'])
    
    if durations:
        print(f"Performance Analysis:")
        print(f"  Average Duration: {sum(durations)/len(durations):.2f}s")
        print(f"  Min Duration: {min(durations):.2f}s")
        print(f"  Max Duration: {max(durations):.2f}s")
    
    if memory_usage:
        print(f"  Average Memory Usage: {sum(memory_usage)/len(memory_usage):.1f}GB")
        print(f"  Peak Memory Usage: {max(memory_usage):.1f}GB")
    
    if gpu_utilization:
        print(f"  Average GPU Utilization: {sum(gpu_utilization)/len(gpu_utilization):.1f}%")
        print(f"  Peak GPU Utilization: {max(gpu_utilization):.1f}%")
    
    return {
        'durations': durations,
        'memory_usage': memory_usage,
        'gpu_utilization': gpu_utilization
    }

# Real-time log monitoring
def monitor_logs_realtime(endpoint_id, callback=None):
    """Monitor logs in real-time using streaming"""
    
    def default_callback(log_entry):
        timestamp = log_entry['timestamp']
        level = log_entry['level']
        message = log_entry['message']
        source = log_entry.get('source', 'unknown')
        
        print(f"[{timestamp}] {level.upper()}: {message} ({source})")
        
        # Alert on errors
        if level in ['error', 'fatal']:
            print(f"🚨 ALERT: {level.upper()} in {source}")
            if log_entry.get('metadata', {}).get('errorCode'):
                print(f"   Error Code: {log_entry['metadata']['errorCode']}")
    
    callback = callback or default_callback
    
    try:
        # Start streaming logs
        for log_entry in client.endpoints.stream_logs(endpoint_id):
            callback(log_entry)
            
    except KeyboardInterrupt:
        print("\nStopping log monitoring...")
    except Exception as e:
        print(f"Error in log monitoring: {e}")

# Log export and analysis
def export_logs_to_dataframe(endpoint_id, hours_back=24):
    """Export logs to pandas DataFrame for analysis"""
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(hours=hours_back)
    
    # Get all logs
    logs = client.endpoints.get_logs(
        endpoint_id=endpoint_id,
        start_time=start_time.isoformat() + 'Z',
        end_time=end_time.isoformat() + 'Z',
        level='debug',  # Get all levels
        limit=10000
    )
    
    # Convert to DataFrame
    log_data = []
    for log in logs.logs:
        row = {
            'timestamp': pd.to_datetime(log.timestamp),
            'level': log.level,
            'message': log.message,
            'source': log.source,
            'job_id': log.job_id,
            'request_id': log.request_id,
            'duration': log.duration,
            'memory_usage': log.memory_usage,
            'gpu_utilization': log.gpu_utilization
        }
        
        # Add metadata fields
        if log.metadata:
            for key, value in log.metadata.items():
                row[f'meta_{key}'] = value
        
        log_data.append(row)
    
    df = pd.DataFrame(log_data)
    
    # Basic analysis
    print(f"Log Analysis Summary:")
    print(f"Total logs: {len(df)}")
    print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
    print(f"Log levels:")
    print(df['level'].value_counts())
    print(f"\nTop sources:")
    print(df['source'].value_counts().head())
    
    return df

# Automated error alerting
def setup_error_alerting(endpoint_ids, check_interval=300):
    """Set up automated error alerting for multiple endpoints"""
    
    def check_recent_errors():
        end_time = datetime.utcnow()
        start_time = end_time - timedelta(seconds=check_interval)
        
        alerts = []
        
        for endpoint_id in endpoint_ids:
            try:
                error_logs = client.endpoints.get_logs(
                    endpoint_id=endpoint_id,
                    start_time=start_time.isoformat() + 'Z',
                    end_time=end_time.isoformat() + 'Z',
                    level='error'
                )
                
                if error_logs.logs:
                    error_count = len(error_logs.logs)
                    alerts.append({
                        'endpoint_id': endpoint_id,
                        'error_count': error_count,
                        'recent_errors': error_logs.logs[:3]  # Last 3 errors
                    })
                    
            except Exception as e:
                print(f"Error checking logs for {endpoint_id}: {e}")
        
        if alerts:
            print(f"\n🚨 ERROR ALERT - {datetime.utcnow().isoformat()}")
            for alert in alerts:
                print(f"Endpoint {alert['endpoint_id']}: {alert['error_count']} new errors")
                for error in alert['recent_errors']:
                    print(f"  - {error.message}")
        
        return alerts
    
    print(f"Starting error monitoring for {len(endpoint_ids)} endpoints...")
    print(f"Check interval: {check_interval} seconds")
    
    try:
        while True:
            check_recent_errors()
            time.sleep(check_interval)
    except KeyboardInterrupt:
        print("\nStopping error monitoring...")

# Usage examples
if __name__ == "__main__":
    endpoint_id = "ep_1234567890abcdef"
    
    # Basic log retrieval
    logs = get_endpoint_logs(endpoint_id, hours_back=2)
    
    # Error analysis
    error_analysis = analyze_errors(endpoint_id, hours_back=24)
    
    # Performance analysis
    perf_data = analyze_performance(endpoint_id, hours_back=6)
    
    # Export to DataFrame for advanced analysis
    df = export_logs_to_dataframe(endpoint_id, hours_back=12)
    
    # Real-time monitoring (uncomment to run)
    # monitor_logs_realtime(endpoint_id)
    
    # Automated alerting (uncomment to run)
    # endpoints = ["ep_1234567890abcdef", "ep_2345678901bcdefg"]
    # setup_error_alerting(endpoints, check_interval=300)

JavaScript SDK

import { TensorOneClient } from "@tensorone/sdk";
import fs from 'fs';

const client = new TensorOneClient({ apiKey: "your_api_key" });

// Basic log retrieval
async function getEndpointLogs(endpointId, hoursBack = 1) {
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
    
    const logs = await client.endpoints.getLogs(endpointId, {
        startTime: startTime.toISOString(),
        endTime: endTime.toISOString(),
        level: 'info',
        limit: 1000
    });
    
    console.log(`Retrieved ${logs.logs.length} logs for ${endpointId}`);
    
    // Display summary
    const summary = logs.summary;
    console.log('Log levels:', summary.logLevels);
    console.log('Sources:', summary.sources);
    
    return logs;
}

// Real-time log monitoring with EventSource
async function monitorLogsRealtime(endpointId, options = {}) {
    const { 
        onLog = console.log,
        onError = console.error,
        levelFilter = 'info',
        reconnectDelay = 5000 
    } = options;
    
    let reconnectTimeout;
    
    function connect() {
        const eventSource = new EventSource(
            `https://api.tensorone.ai/v2/endpoints/${endpointId}/logs?stream=true&level=${levelFilter}`,
            {
                headers: {
                    'Authorization': `Bearer ${process.env.TENSORONE_API_KEY}`
                }
            }
        );
        
        eventSource.onmessage = (event) => {
            try {
                const logEntry = JSON.parse(event.data);
                onLog(logEntry);
                
                // Alert on errors
                if (logEntry.level === 'error' || logEntry.level === 'fatal') {
                    console.warn(`🚨 ${logEntry.level.toUpperCase()}: ${logEntry.message}`);
                }
            } catch (error) {
                console.error('Error parsing log entry:', error);
            }
        };
        
        eventSource.onerror = (error) => {
            console.error('Log stream error:', error);
            eventSource.close();
            
            // Reconnect after delay
            console.log(`Reconnecting in ${reconnectDelay/1000} seconds...`);
            reconnectTimeout = setTimeout(connect, reconnectDelay);
        };
        
        // Handle graceful shutdown
        process.on('SIGINT', () => {
            console.log('\nClosing log stream...');
            eventSource.close();
            if (reconnectTimeout) {
                clearTimeout(reconnectTimeout);
            }
            process.exit(0);
        });
        
        return eventSource;
    }
    
    return connect();
}

// Error pattern analysis
async function analyzeErrorPatterns(endpointId, hoursBack = 24) {
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
    
    const errorLogs = await client.endpoints.getLogs(endpointId, {
        startTime: startTime.toISOString(),
        endTime: endTime.toISOString(),
        level: 'error',
        limit: 5000
    });
    
    console.log(`Error Analysis for ${endpointId}`);
    console.log(`Total errors in last ${hoursBack} hours: ${errorLogs.logs.length}`);
    
    // Analyze error patterns
    const errorCodes = {};
    const errorSources = {};
    const errorTimeline = {};
    
    errorLogs.logs.forEach(log => {
        // Count error codes
        if (log.metadata?.errorCode) {
            errorCodes[log.metadata.errorCode] = (errorCodes[log.metadata.errorCode] || 0) + 1;
        }
        
        // Count error sources
        errorSources[log.source] = (errorSources[log.source] || 0) + 1;
        
        // Create hourly timeline
        const hour = new Date(log.timestamp).toISOString().substring(0, 13) + ':00:00Z';
        errorTimeline[hour] = (errorTimeline[hour] || 0) + 1;
    });
    
    console.log('\nTop Error Codes:');
    Object.entries(errorCodes)
        .sort(([,a], [,b]) => b - a)
        .forEach(([code, count]) => {
            console.log(`  ${code}: ${count}`);
        });
    
    console.log('\nErrors by Source:');
    Object.entries(errorSources)
        .sort(([,a], [,b]) => b - a)
        .forEach(([source, count]) => {
            console.log(`  ${source}: ${count}`);
        });
    
    console.log('\nError Timeline (hourly):');
    Object.entries(errorTimeline)
        .sort(([a], [b]) => a.localeCompare(b))
        .forEach(([hour, count]) => {
            console.log(`  ${hour}: ${count} errors`);
        });
    
    return {
        errorCodes,
        errorSources,
        errorTimeline,
        totalErrors: errorLogs.logs.length
    };
}

// Performance metrics from logs
async function analyzePerformanceFromLogs(endpointId, hoursBack = 6) {
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
    
    const perfLogs = await client.endpoints.getLogs(endpointId, {
        startTime: startTime.toISOString(),
        endTime: endTime.toISOString(),
        search: 'duration|latency|performance|completed',
        level: 'info',
        limit: 10000
    });
    
    const durations = [];
    const memoryUsage = [];
    const gpuUtilization = [];
    
    perfLogs.logs.forEach(log => {
        if (log.duration) {
            durations.push(log.duration);
        }
        
        if (log.metadata) {
            // Extract memory usage
            if (log.metadata.memoryUsage && typeof log.metadata.memoryUsage === 'string') {
                const memoryMatch = log.metadata.memoryUsage.match(/(\d+\.?\d*)GB/);
                if (memoryMatch) {
                    memoryUsage.push(parseFloat(memoryMatch[1]));
                }
            }
            
            // Extract GPU utilization
            if (typeof log.metadata.gpuUtilization === 'number') {
                gpuUtilization.push(log.metadata.gpuUtilization);
            }
        }
    });
    
    const analysis = {
        durations: {
            count: durations.length,
            average: durations.length ? durations.reduce((a, b) => a + b, 0) / durations.length : 0,
            min: durations.length ? Math.min(...durations) : 0,
            max: durations.length ? Math.max(...durations) : 0,
            p95: durations.length ? percentile(durations, 95) : 0
        },
        memory: {
            count: memoryUsage.length,
            average: memoryUsage.length ? memoryUsage.reduce((a, b) => a + b, 0) / memoryUsage.length : 0,
            peak: memoryUsage.length ? Math.max(...memoryUsage) : 0
        },
        gpu: {
            count: gpuUtilization.length,
            average: gpuUtilization.length ? gpuUtilization.reduce((a, b) => a + b, 0) / gpuUtilization.length : 0,
            peak: gpuUtilization.length ? Math.max(...gpuUtilization) : 0
        }
    };
    
    console.log('Performance Analysis:');
    console.log(`  Average Duration: ${analysis.durations.average.toFixed(2)}s`);
    console.log(`  P95 Duration: ${analysis.durations.p95.toFixed(2)}s`);
    console.log(`  Peak Memory Usage: ${analysis.memory.peak.toFixed(1)}GB`);
    console.log(`  Average GPU Utilization: ${analysis.gpu.average.toFixed(1)}%`);
    
    return analysis;
}

// Helper function to calculate percentiles
function percentile(arr, p) {
    const sorted = [...arr].sort((a, b) => a - b);
    const index = (p / 100) * (sorted.length - 1);
    const lower = Math.floor(index);
    const upper = Math.ceil(index);
    
    if (lower === upper) {
        return sorted[lower];
    }
    
    const weight = index - lower;
    return sorted[lower] * (1 - weight) + sorted[upper] * weight;
}

// Log export functionality
async function exportLogs(endpointId, format = 'json', hoursBack = 24) {
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
    
    const logs = await client.endpoints.getLogs(endpointId, {
        startTime: startTime.toISOString(),
        endTime: endTime.toISOString(),
        format: format,
        limit: 10000
    });
    
    const filename = `${endpointId}_logs_${startTime.toISOString().split('T')[0]}.${format}`;
    
    if (format === 'json') {
        fs.writeFileSync(filename, JSON.stringify(logs, null, 2));
    } else {
        fs.writeFileSync(filename, logs);
    }
    
    console.log(`Logs exported to ${filename}`);
    return filename;
}

// Automated error alerting
class LogAlerting {
    constructor(endpointIds, options = {}) {
        this.endpointIds = endpointIds;
        this.checkInterval = options.checkInterval || 300000; // 5 minutes
        this.errorThreshold = options.errorThreshold || 5;
        this.callbacks = {
            onAlert: options.onAlert || this.defaultAlertHandler,
            onError: options.onError || console.error
        };
        this.isRunning = false;
        this.intervalId = null;
    }
    
    defaultAlertHandler(alerts) {
        console.log(`\n🚨 ERROR ALERTS - ${new Date().toISOString()}`);
        alerts.forEach(alert => {
            console.log(`Endpoint ${alert.endpointId}: ${alert.errorCount} new errors`);
            alert.recentErrors.slice(0, 3).forEach(error => {
                console.log(`  - ${error.message}`);
            });
        });
    }
    
    async checkForErrors() {
        const endTime = new Date();
        const startTime = new Date(endTime.getTime() - this.checkInterval);
        
        const alerts = [];
        
        for (const endpointId of this.endpointIds) {
            try {
                const errorLogs = await client.endpoints.getLogs(endpointId, {
                    startTime: startTime.toISOString(),
                    endTime: endTime.toISOString(),
                    level: 'error',
                    limit: 100
                });
                
                if (errorLogs.logs.length >= this.errorThreshold) {
                    alerts.push({
                        endpointId,
                        errorCount: errorLogs.logs.length,
                        recentErrors: errorLogs.logs
                    });
                }
                
            } catch (error) {
                this.callbacks.onError(`Error checking logs for ${endpointId}:`, error);
            }
        }
        
        if (alerts.length > 0) {
            this.callbacks.onAlert(alerts);
        }
        
        return alerts;
    }
    
    start() {
        if (this.isRunning) {
            console.log('Alerting is already running');
            return;
        }
        
        console.log(`Starting error monitoring for ${this.endpointIds.length} endpoints...`);
        console.log(`Check interval: ${this.checkInterval / 1000} seconds`);
        console.log(`Error threshold: ${this.errorThreshold} errors per interval`);
        
        this.isRunning = true;
        this.intervalId = setInterval(() => {
            this.checkForErrors();
        }, this.checkInterval);
        
        // Initial check
        this.checkForErrors();
    }
    
    stop() {
        if (!this.isRunning) {
            console.log('Alerting is not running');
            return;
        }
        
        console.log('Stopping error monitoring...');
        this.isRunning = false;
        
        if (this.intervalId) {
            clearInterval(this.intervalId);
            this.intervalId = null;
        }
    }
}

// Usage examples
async function main() {
    const endpointId = "ep_1234567890abcdef";
    const endpointIds = ["ep_1234567890abcdef", "ep_2345678901bcdefg"];
    
    try {
        // Basic log retrieval
        const logs = await getEndpointLogs(endpointId, 2);
        
        // Error analysis
        const errorAnalysis = await analyzeErrorPatterns(endpointId, 24);
        
        // Performance analysis
        const perfAnalysis = await analyzePerformanceFromLogs(endpointId, 6);
        
        // Export logs
        await exportLogs(endpointId, 'json', 12);
        
        // Set up automated alerting
        const alerting = new LogAlerting(endpointIds, {
            checkInterval: 300000, // 5 minutes
            errorThreshold: 3,
            onAlert: (alerts) => {
                // Custom alert handling
                console.log('Custom alert handler triggered!');
                alerts.forEach(alert => {
                    console.log(`🔥 ${alert.endpointId}: ${alert.errorCount} errors!`);
                });
            }
        });
        
        // Start alerting (uncomment to run)
        // alerting.start();
        
        // Real-time monitoring (uncomment to run)
        // monitorLogsRealtime(endpointId, {
        //     levelFilter: 'info',
        //     onLog: (log) => {
        //         console.log(`[${log.timestamp}] ${log.level}: ${log.message}`);
        //     }
        // });
        
    } catch (error) {
        console.error("Log analysis error:", error);
    }
}

main();

Use Cases

Production Debugging

Error Investigation: Quickly identify and analyze production errors
Performance Troubleshooting: Diagnose latency and throughput issues
Resource Problems: Monitor memory leaks and resource exhaustion
Integration Issues: Debug API calls and external service failures

Development and Testing

Development Debugging: Monitor application behavior during development
Load Testing: Analyze system behavior under load
Performance Optimization: Identify optimization opportunities
Quality Assurance: Verify correct application behavior

Operations and Monitoring

Real-time Monitoring: Monitor system health in real-time
Alerting Systems: Set up automated alerts for critical issues
Compliance Auditing: Maintain audit trails for compliance requirements
Capacity Planning: Analyze usage patterns for capacity planning

Business Intelligence

Usage Analytics: Understand user behavior and usage patterns
Performance Metrics: Track application performance over time
Cost Analysis: Analyze operational costs and optimization opportunities
Trend Analysis: Identify patterns and trends in application usage

Best Practices

Log Management

Structured Logging: Use structured log formats for easier analysis
Log Levels: Use appropriate log levels to control verbosity
Retention Policies: Define retention policies based on compliance requirements
Storage Optimization: Use appropriate storage tiers for different log types

Performance Considerations

Filtering: Use specific filters to reduce data transfer and processing
Pagination: Use pagination for large log sets to avoid timeouts
Streaming: Use streaming for real-time monitoring instead of polling
Caching: Cache log analysis results when appropriate

Security and Compliance

Access Control: Implement proper access controls for sensitive logs
Data Privacy: Ensure logs don’t contain sensitive personal information
Audit Trails: Maintain audit trails for log access and modifications
Encryption: Use encryption for sensitive log data

Monitoring and Alerting

Proactive Monitoring: Set up proactive monitoring for critical issues
Alert Thresholds: Set appropriate thresholds to avoid alert fatigue
Escalation Procedures: Define clear escalation procedures for different alert types
Integration: Integrate with existing monitoring and alerting systems

Logs are retained for 30 days by default. For longer retention, consider exporting logs to your own storage system or upgrading to a plan with extended retention.

Log streaming consumes resources and should be used judiciously. Close streaming connections when not needed to avoid unnecessary resource usage.

Use structured search patterns and metadata filtering to quickly find relevant logs. Consider setting up automated log analysis pipelines for common debugging scenarios.

Authorizations

Authorization

string

header

required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

endpointId

string

required

Query Parameters

lines

integer

default:100

Number of log lines to retrieve

Required range: x <= 1000

Response

200 - application/json

Endpoint logs

The response is of type object.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Path Parameters

​Query Parameters

​Example Usage

​Basic Log Retrieval

​Filtered Logs with Time Range

​Job-Specific Logs

​Search Logs with Pattern

​Real-time Log Streaming

​Export Logs in CSV Format

​Response

​JSON Format Response

​Error Logs with Stack Traces

​Performance Logs

​Text Format Response

​Real-time Streaming Response

​Log Levels

​Level Hierarchy

​Log Sources

​Advanced Filtering

​Search Patterns

​Complex Filtering

​Log Aggregation

​Log Aggregation Endpoint

​Aggregation Response

​Error Handling

​400 Bad Request

​403 Forbidden

​413 Payload Too Large

​429 Rate Limited

​SDK Examples

​Python SDK

​JavaScript SDK

​Use Cases

​Production Debugging

​Development and Testing

​Operations and Monitoring

​Business Intelligence

​Best Practices

​Log Management

​Performance Considerations

​Security and Compliance

​Monitoring and Alerting

Authorizations

Path Parameters

Query Parameters

Response

Path Parameters

Query Parameters

Example Usage

Basic Log Retrieval

Filtered Logs with Time Range

Job-Specific Logs

Search Logs with Pattern

Real-time Log Streaming

Export Logs in CSV Format

Response

JSON Format Response

Error Logs with Stack Traces

Performance Logs

Text Format Response

Real-time Streaming Response

Log Levels

Level Hierarchy

Log Sources

Advanced Filtering

Search Patterns

Complex Filtering

Log Aggregation

Log Aggregation Endpoint

Aggregation Response

Error Handling

400 Bad Request

403 Forbidden

413 Payload Too Large

429 Rate Limited

SDK Examples

Python SDK

JavaScript SDK

Use Cases

Production Debugging

Development and Testing

Operations and Monitoring

Business Intelligence

Best Practices

Log Management

Performance Considerations

Security and Compliance

Monitoring and Alerting