Path Parameters
endpointId
: The unique identifier of the endpoint to retrieve logs for
Query Parameters
startTime
: Start time for log retrieval (ISO 8601 format, e.g.,2024-01-15T10:30:00Z
)endTime
: End time for log retrieval (ISO 8601 format, defaults to current time)level
: Log level filter (debug
,info
,warn
,error
,fatal
) - defaults toinfo
limit
: Maximum number of log entries to return (1-10000) - defaults to 1000offset
: Number of log entries to skip for pagination - defaults to 0jobId
: Filter logs for a specific job executionsearch
: Search term to filter log messages (supports regex)format
: Response format (json
,text
,csv
) - defaults tojson
stream
: Enable real-time log streaming (true
,false
) - defaults tofalse
Example Usage
Basic Log Retrieval
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
Filtered Logs with Time Range
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?startTime=2024-01-15T10:00:00Z&endTime=2024-01-15T11:00:00Z&level=error" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
Job-Specific Logs
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?jobId=job_1234567890abcdef&level=debug" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
Search Logs with Pattern
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=memory.*error&level=warn" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
Real-time Log Streaming
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?stream=true" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Accept: text/event-stream"
Export Logs in CSV Format
Copy
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?format=csv&startTime=2024-01-15T00:00:00Z&endTime=2024-01-15T23:59:59Z" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Accept: text/csv"
Response
JSON Format Response
Copy
{
"endpointId": "ep_1234567890abcdef",
"totalLogs": 1247,
"filteredLogs": 856,
"startTime": "2024-01-15T10:00:00Z",
"endTime": "2024-01-15T11:00:00Z",
"pagination": {
"limit": 1000,
"offset": 0,
"hasMore": false,
"nextOffset": null
},
"logs": [
{
"timestamp": "2024-01-15T10:30:15.234Z",
"level": "info",
"message": "Processing image generation request",
"jobId": "job_1234567890abcdef",
"requestId": "req_abc123def456",
"source": "model_inference",
"metadata": {
"prompt": "A sunset over mountains",
"model": "stable-diffusion-xl",
"parameters": {
"steps": 30,
"guidance_scale": 7.5,
"width": 1024,
"height": 1024
}
},
"duration": null,
"memoryUsage": "8.2GB",
"gpuUtilization": 87
},
{
"timestamp": "2024-01-15T10:30:18.567Z",
"level": "debug",
"message": "Model loaded successfully",
"jobId": "job_1234567890abcdef",
"requestId": "req_abc123def456",
"source": "model_loader",
"metadata": {
"modelPath": "/models/stable-diffusion-xl/model.safetensors",
"loadTime": 3.2,
"modelSize": "6.9GB",
"precision": "fp16"
},
"duration": 3.2,
"memoryUsage": "6.9GB",
"gpuUtilization": 45
},
{
"timestamp": "2024-01-15T10:30:22.891Z",
"level": "info",
"message": "Image generation completed",
"jobId": "job_1234567890abcdef",
"requestId": "req_abc123def456",
"source": "model_inference",
"metadata": {
"outputPath": "/tmp/generated_image_abc123.png",
"generationTime": 4.3,
"seed": 42,
"finalSteps": 30
},
"duration": 4.3,
"memoryUsage": "8.2GB",
"gpuUtilization": 92
},
{
"timestamp": "2024-01-15T10:30:25.123Z",
"level": "warn",
"message": "High GPU memory usage detected",
"jobId": "job_1234567890abcdef",
"requestId": "req_abc123def456",
"source": "resource_monitor",
"metadata": {
"currentUsage": "38.5GB",
"totalMemory": "40GB",
"utilizationPercent": 96.25,
"recommendation": "Consider reducing batch size or image resolution"
},
"duration": null,
"memoryUsage": "38.5GB",
"gpuUtilization": 96
},
{
"timestamp": "2024-01-15T10:30:28.456Z",
"level": "error",
"message": "Failed to upload result to storage",
"jobId": "job_1234567890abcdef",
"requestId": "req_abc123def456",
"source": "storage_uploader",
"metadata": {
"error": "Connection timeout after 30 seconds",
"errorCode": "STORAGE_TIMEOUT",
"retryAttempt": 1,
"maxRetries": 3,
"filePath": "/tmp/generated_image_abc123.png",
"fileSize": "2.4MB"
},
"duration": 30.0,
"stackTrace": [
"at StorageUploader.upload (storage.js:45:12)",
"at async ImageProcessor.saveResult (processor.js:128:8)",
"at async handleRequest (handler.js:67:4)"
]
}
],
"summary": {
"logLevels": {
"debug": 234,
"info": 456,
"warn": 123,
"error": 43,
"fatal": 0
},
"sources": {
"model_inference": 245,
"model_loader": 67,
"resource_monitor": 156,
"storage_uploader": 89,
"api_handler": 299
},
"commonErrors": [
{
"error": "STORAGE_TIMEOUT",
"count": 12,
"firstOccurrence": "2024-01-15T10:15:30Z",
"lastOccurrence": "2024-01-15T10:55:12Z"
},
{
"error": "MEMORY_LIMIT_EXCEEDED",
"count": 8,
"firstOccurrence": "2024-01-15T10:22:45Z",
"lastOccurrence": "2024-01-15T10:48:30Z"
}
]
}
}
Error Logs with Stack Traces
Copy
{
"timestamp": "2024-01-15T10:35:42.789Z",
"level": "error",
"message": "Model inference failed with CUDA out of memory error",
"jobId": "job_error_example",
"requestId": "req_error_456",
"source": "model_inference",
"metadata": {
"error": "CUDA out of memory. Tried to allocate 2.50 GiB",
"errorCode": "CUDA_OOM",
"gpuMemoryUsed": "39.2GB",
"gpuMemoryTotal": "40GB",
"requestedAllocation": "2.5GB",
"model": "llama-2-70b",
"batchSize": 4,
"sequenceLength": 2048
},
"stackTrace": [
"RuntimeError: CUDA out of memory. Tried to allocate 2.50 GiB",
" at torch.cuda.OutOfMemoryError",
" at model_inference.py:156 in forward()",
" at inference_handler.py:89 in process_batch()",
" at main.py:45 in handle_request()"
],
"context": {
"previousRequests": [
{
"requestId": "req_456789",
"memoryUsage": "35.8GB",
"status": "completed"
},
{
"requestId": "req_567890",
"memoryUsage": "37.1GB",
"status": "completed"
}
],
"systemState": {
"availableMemory": "0.8GB",
"activeProcesses": 3,
"cacheSize": "12.4GB"
}
},
"recoveryActions": [
{
"type": "memory_cleanup",
"description": "Clear model cache and retry",
"executed": true,
"result": "freed 8.2GB memory"
},
{
"type": "batch_size_reduction",
"description": "Reduce batch size from 4 to 2",
"executed": true,
"result": "retry successful"
}
]
}
Performance Logs
Copy
{
"timestamp": "2024-01-15T10:40:15.123Z",
"level": "info",
"message": "Request processing completed",
"jobId": "job_perf_example",
"requestId": "req_perf_789",
"source": "performance_tracker",
"metadata": {
"totalDuration": 12.5,
"phases": {
"queueTime": 0.2,
"coldStartTime": 0.0,
"modelLoadTime": 0.0,
"inferenceTime": 11.8,
"postProcessingTime": 0.3,
"uploadTime": 0.2
},
"resourceUsage": {
"peakGpuMemory": "32.1GB",
"peakGpuUtilization": 94,
"avgCpuUsage": 45,
"networkIO": {
"ingress": "125MB",
"egress": "8.2MB"
}
},
"optimizations": {
"cacheHit": true,
"modelReused": true,
"batchProcessed": false
}
},
"benchmarks": {
"targetLatency": 10.0,
"actualLatency": 12.5,
"performance": "within_sla",
"percentile": "p85"
}
}
Text Format Response
Copy
2024-01-15T10:30:15.234Z [INFO] model_inference: Processing image generation request (job_1234567890abcdef)
2024-01-15T10:30:18.567Z [DEBUG] model_loader: Model loaded successfully in 3.2s (job_1234567890abcdef)
2024-01-15T10:30:22.891Z [INFO] model_inference: Image generation completed in 4.3s (job_1234567890abcdef)
2024-01-15T10:30:25.123Z [WARN] resource_monitor: High GPU memory usage detected: 96.25% (job_1234567890abcdef)
2024-01-15T10:30:28.456Z [ERROR] storage_uploader: Failed to upload result to storage: Connection timeout (job_1234567890abcdef)
Real-time Streaming Response
Copy
data: {"timestamp":"2024-01-15T10:45:00.123Z","level":"info","message":"New request received","jobId":"job_live_stream","source":"api_handler"}
data: {"timestamp":"2024-01-15T10:45:01.456Z","level":"debug","message":"Loading model weights","jobId":"job_live_stream","source":"model_loader"}
data: {"timestamp":"2024-01-15T10:45:05.789Z","level":"info","message":"Model ready for inference","jobId":"job_live_stream","source":"model_loader"}
data: {"timestamp":"2024-01-15T10:45:08.012Z","level":"info","message":"Processing completed successfully","jobId":"job_live_stream","source":"model_inference"}
Log Levels
Level Hierarchy
debug
: Detailed diagnostic information for development and troubleshootinginfo
: General informational messages about normal operationwarn
: Warning messages for potentially problematic situationserror
: Error messages for failures that don’t stop executionfatal
: Critical errors that cause execution to stop
Log Sources
api_handler
: API request handling and routingmodel_loader
: Model loading and initializationmodel_inference
: Model execution and inferenceresource_monitor
: System resource monitoringstorage_uploader
: File upload and storage operationscache_manager
: Caching system operationsauto_scaler
: Auto-scaling events and decisions
Advanced Filtering
Search Patterns
Copy
# Search for memory-related errors
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=memory.*error|OOM|out.*memory" \
-H "Authorization: Bearer YOUR_API_KEY"
# Search for specific model operations
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=model_(load|unload|inference)" \
-H "Authorization: Bearer YOUR_API_KEY"
# Search for performance issues
curl -X GET "https://api.tensorone.ai/v2/endpoints/ep_1234567890abcdef/logs?search=timeout|slow|latency|performance" \
-H "Authorization: Bearer YOUR_API_KEY"
Complex Filtering
Copy
curl -X POST "https://api.tensorone.ai/v2/endpoints/logs/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"endpointIds": ["ep_1234567890abcdef"],
"filters": {
"timeRange": {
"start": "2024-01-15T10:00:00Z",
"end": "2024-01-15T11:00:00Z"
},
"levels": ["warn", "error", "fatal"],
"sources": ["model_inference", "resource_monitor"],
"search": {
"query": "memory.*usage|GPU.*utilization",
"caseSensitive": false
},
"metadata": {
"gpuUtilization": {"$gt": 90},
"memoryUsage": {"$regex": "3[0-9]\\.[0-9]GB"}
}
},
"sort": {
"field": "timestamp",
"order": "desc"
},
"limit": 500
}'
Log Aggregation
Log Aggregation Endpoint
Copy
curl -X POST "https://api.tensorone.ai/v2/endpoints/logs/aggregate" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"endpointIds": ["ep_1234567890abcdef"],
"timeRange": {
"start": "2024-01-15T00:00:00Z",
"end": "2024-01-15T23:59:59Z"
},
"aggregations": [
{
"name": "errors_by_hour",
"groupBy": ["level", "hour"],
"filters": {"level": ["error", "fatal"]},
"metrics": ["count", "unique_jobs"]
},
{
"name": "performance_metrics",
"groupBy": ["source"],
"metrics": ["avg_duration", "p95_duration", "count"]
}
]
}'
Aggregation Response
Copy
{
"aggregations": {
"errors_by_hour": [
{
"level": "error",
"hour": "2024-01-15T10:00:00Z",
"count": 23,
"unique_jobs": 18
},
{
"level": "error",
"hour": "2024-01-15T11:00:00Z",
"count": 15,
"unique_jobs": 12
}
],
"performance_metrics": [
{
"source": "model_inference",
"count": 1247,
"avg_duration": 8.5,
"p95_duration": 15.2
},
{
"source": "model_loader",
"count": 89,
"avg_duration": 12.3,
"p95_duration": 25.8
}
]
}
}
Error Handling
400 Bad Request
Copy
{
"error": "INVALID_TIME_RANGE",
"message": "Start time must be before end time",
"details": {
"startTime": "2024-01-15T12:00:00Z",
"endTime": "2024-01-15T10:00:00Z",
"maxTimeRange": "24h"
}
}
403 Forbidden
Copy
{
"error": "INSUFFICIENT_PERMISSIONS",
"message": "Logs access requires logs:read permission",
"details": {
"requiredPermission": "logs:read",
"currentPermissions": ["endpoints:execute"]
}
}
413 Payload Too Large
Copy
{
"error": "TOO_MANY_LOGS",
"message": "Requested log range contains too many entries",
"details": {
"requestedCount": 50000,
"maxAllowed": 10000,
"suggestion": "Use smaller time ranges or increase pagination"
}
}
429 Rate Limited
Copy
{
"error": "RATE_LIMIT_EXCEEDED",
"message": "Too many log requests",
"details": {
"limit": 100,
"window": "1h",
"retryAfter": 60,
"suggestion": "Use streaming for real-time logs"
}
}
SDK Examples
Python SDK
Copy
from tensorone import TensorOneClient
import json
import time
from datetime import datetime, timedelta
import pandas as pd
client = TensorOneClient(api_key="your_api_key")
# Basic log retrieval
def get_endpoint_logs(endpoint_id, hours_back=1):
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours_back)
logs = client.endpoints.get_logs(
endpoint_id=endpoint_id,
start_time=start_time.isoformat() + 'Z',
end_time=end_time.isoformat() + 'Z',
level='info',
limit=1000
)
print(f"Retrieved {len(logs.logs)} logs for {endpoint_id}")
# Display summary
summary = logs.summary
print(f"Log levels: {summary.log_levels}")
print(f"Sources: {summary.sources}")
return logs
# Error analysis
def analyze_errors(endpoint_id, hours_back=24):
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours_back)
# Get error logs
error_logs = client.endpoints.get_logs(
endpoint_id=endpoint_id,
start_time=start_time.isoformat() + 'Z',
end_time=end_time.isoformat() + 'Z',
level='error',
limit=5000
)
print(f"Error Analysis for {endpoint_id}")
print(f"Total errors in last {hours_back} hours: {len(error_logs.logs)}")
# Analyze error patterns
error_codes = {}
error_sources = {}
for log in error_logs.logs:
# Count error codes
if log.metadata and 'errorCode' in log.metadata:
code = log.metadata['errorCode']
error_codes[code] = error_codes.get(code, 0) + 1
# Count error sources
source = log.source
error_sources[source] = error_sources.get(source, 0) + 1
print("\nTop Error Codes:")
for code, count in sorted(error_codes.items(), key=lambda x: x[1], reverse=True):
print(f" {code}: {count}")
print("\nErrors by Source:")
for source, count in sorted(error_sources.items(), key=lambda x: x[1], reverse=True):
print(f" {source}: {count}")
return error_logs
# Performance analysis from logs
def analyze_performance(endpoint_id, hours_back=6):
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours_back)
# Get performance-related logs
perf_logs = client.endpoints.get_logs(
endpoint_id=endpoint_id,
start_time=start_time.isoformat() + 'Z',
end_time=end_time.isoformat() + 'Z',
search='duration|latency|performance|completed',
level='info'
)
durations = []
memory_usage = []
gpu_utilization = []
for log in perf_logs.logs:
if log.duration:
durations.append(log.duration)
if log.metadata:
# Extract memory usage
if 'memoryUsage' in log.metadata:
memory_str = log.metadata['memoryUsage']
if 'GB' in memory_str:
memory_val = float(memory_str.replace('GB', ''))
memory_usage.append(memory_val)
# Extract GPU utilization
if 'gpuUtilization' in log.metadata:
gpu_utilization.append(log.metadata['gpuUtilization'])
if durations:
print(f"Performance Analysis:")
print(f" Average Duration: {sum(durations)/len(durations):.2f}s")
print(f" Min Duration: {min(durations):.2f}s")
print(f" Max Duration: {max(durations):.2f}s")
if memory_usage:
print(f" Average Memory Usage: {sum(memory_usage)/len(memory_usage):.1f}GB")
print(f" Peak Memory Usage: {max(memory_usage):.1f}GB")
if gpu_utilization:
print(f" Average GPU Utilization: {sum(gpu_utilization)/len(gpu_utilization):.1f}%")
print(f" Peak GPU Utilization: {max(gpu_utilization):.1f}%")
return {
'durations': durations,
'memory_usage': memory_usage,
'gpu_utilization': gpu_utilization
}
# Real-time log monitoring
def monitor_logs_realtime(endpoint_id, callback=None):
"""Monitor logs in real-time using streaming"""
def default_callback(log_entry):
timestamp = log_entry['timestamp']
level = log_entry['level']
message = log_entry['message']
source = log_entry.get('source', 'unknown')
print(f"[{timestamp}] {level.upper()}: {message} ({source})")
# Alert on errors
if level in ['error', 'fatal']:
print(f"🚨 ALERT: {level.upper()} in {source}")
if log_entry.get('metadata', {}).get('errorCode'):
print(f" Error Code: {log_entry['metadata']['errorCode']}")
callback = callback or default_callback
try:
# Start streaming logs
for log_entry in client.endpoints.stream_logs(endpoint_id):
callback(log_entry)
except KeyboardInterrupt:
print("\nStopping log monitoring...")
except Exception as e:
print(f"Error in log monitoring: {e}")
# Log export and analysis
def export_logs_to_dataframe(endpoint_id, hours_back=24):
"""Export logs to pandas DataFrame for analysis"""
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours_back)
# Get all logs
logs = client.endpoints.get_logs(
endpoint_id=endpoint_id,
start_time=start_time.isoformat() + 'Z',
end_time=end_time.isoformat() + 'Z',
level='debug', # Get all levels
limit=10000
)
# Convert to DataFrame
log_data = []
for log in logs.logs:
row = {
'timestamp': pd.to_datetime(log.timestamp),
'level': log.level,
'message': log.message,
'source': log.source,
'job_id': log.job_id,
'request_id': log.request_id,
'duration': log.duration,
'memory_usage': log.memory_usage,
'gpu_utilization': log.gpu_utilization
}
# Add metadata fields
if log.metadata:
for key, value in log.metadata.items():
row[f'meta_{key}'] = value
log_data.append(row)
df = pd.DataFrame(log_data)
# Basic analysis
print(f"Log Analysis Summary:")
print(f"Total logs: {len(df)}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Log levels:")
print(df['level'].value_counts())
print(f"\nTop sources:")
print(df['source'].value_counts().head())
return df
# Automated error alerting
def setup_error_alerting(endpoint_ids, check_interval=300):
"""Set up automated error alerting for multiple endpoints"""
def check_recent_errors():
end_time = datetime.utcnow()
start_time = end_time - timedelta(seconds=check_interval)
alerts = []
for endpoint_id in endpoint_ids:
try:
error_logs = client.endpoints.get_logs(
endpoint_id=endpoint_id,
start_time=start_time.isoformat() + 'Z',
end_time=end_time.isoformat() + 'Z',
level='error'
)
if error_logs.logs:
error_count = len(error_logs.logs)
alerts.append({
'endpoint_id': endpoint_id,
'error_count': error_count,
'recent_errors': error_logs.logs[:3] # Last 3 errors
})
except Exception as e:
print(f"Error checking logs for {endpoint_id}: {e}")
if alerts:
print(f"\n🚨 ERROR ALERT - {datetime.utcnow().isoformat()}")
for alert in alerts:
print(f"Endpoint {alert['endpoint_id']}: {alert['error_count']} new errors")
for error in alert['recent_errors']:
print(f" - {error.message}")
return alerts
print(f"Starting error monitoring for {len(endpoint_ids)} endpoints...")
print(f"Check interval: {check_interval} seconds")
try:
while True:
check_recent_errors()
time.sleep(check_interval)
except KeyboardInterrupt:
print("\nStopping error monitoring...")
# Usage examples
if __name__ == "__main__":
endpoint_id = "ep_1234567890abcdef"
# Basic log retrieval
logs = get_endpoint_logs(endpoint_id, hours_back=2)
# Error analysis
error_analysis = analyze_errors(endpoint_id, hours_back=24)
# Performance analysis
perf_data = analyze_performance(endpoint_id, hours_back=6)
# Export to DataFrame for advanced analysis
df = export_logs_to_dataframe(endpoint_id, hours_back=12)
# Real-time monitoring (uncomment to run)
# monitor_logs_realtime(endpoint_id)
# Automated alerting (uncomment to run)
# endpoints = ["ep_1234567890abcdef", "ep_2345678901bcdefg"]
# setup_error_alerting(endpoints, check_interval=300)
JavaScript SDK
Copy
import { TensorOneClient } from "@tensorone/sdk";
import fs from 'fs';
const client = new TensorOneClient({ apiKey: "your_api_key" });
// Basic log retrieval
async function getEndpointLogs(endpointId, hoursBack = 1) {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
const logs = await client.endpoints.getLogs(endpointId, {
startTime: startTime.toISOString(),
endTime: endTime.toISOString(),
level: 'info',
limit: 1000
});
console.log(`Retrieved ${logs.logs.length} logs for ${endpointId}`);
// Display summary
const summary = logs.summary;
console.log('Log levels:', summary.logLevels);
console.log('Sources:', summary.sources);
return logs;
}
// Real-time log monitoring with EventSource
async function monitorLogsRealtime(endpointId, options = {}) {
const {
onLog = console.log,
onError = console.error,
levelFilter = 'info',
reconnectDelay = 5000
} = options;
let reconnectTimeout;
function connect() {
const eventSource = new EventSource(
`https://api.tensorone.ai/v2/endpoints/${endpointId}/logs?stream=true&level=${levelFilter}`,
{
headers: {
'Authorization': `Bearer ${process.env.TENSORONE_API_KEY}`
}
}
);
eventSource.onmessage = (event) => {
try {
const logEntry = JSON.parse(event.data);
onLog(logEntry);
// Alert on errors
if (logEntry.level === 'error' || logEntry.level === 'fatal') {
console.warn(`🚨 ${logEntry.level.toUpperCase()}: ${logEntry.message}`);
}
} catch (error) {
console.error('Error parsing log entry:', error);
}
};
eventSource.onerror = (error) => {
console.error('Log stream error:', error);
eventSource.close();
// Reconnect after delay
console.log(`Reconnecting in ${reconnectDelay/1000} seconds...`);
reconnectTimeout = setTimeout(connect, reconnectDelay);
};
// Handle graceful shutdown
process.on('SIGINT', () => {
console.log('\nClosing log stream...');
eventSource.close();
if (reconnectTimeout) {
clearTimeout(reconnectTimeout);
}
process.exit(0);
});
return eventSource;
}
return connect();
}
// Error pattern analysis
async function analyzeErrorPatterns(endpointId, hoursBack = 24) {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
const errorLogs = await client.endpoints.getLogs(endpointId, {
startTime: startTime.toISOString(),
endTime: endTime.toISOString(),
level: 'error',
limit: 5000
});
console.log(`Error Analysis for ${endpointId}`);
console.log(`Total errors in last ${hoursBack} hours: ${errorLogs.logs.length}`);
// Analyze error patterns
const errorCodes = {};
const errorSources = {};
const errorTimeline = {};
errorLogs.logs.forEach(log => {
// Count error codes
if (log.metadata?.errorCode) {
errorCodes[log.metadata.errorCode] = (errorCodes[log.metadata.errorCode] || 0) + 1;
}
// Count error sources
errorSources[log.source] = (errorSources[log.source] || 0) + 1;
// Create hourly timeline
const hour = new Date(log.timestamp).toISOString().substring(0, 13) + ':00:00Z';
errorTimeline[hour] = (errorTimeline[hour] || 0) + 1;
});
console.log('\nTop Error Codes:');
Object.entries(errorCodes)
.sort(([,a], [,b]) => b - a)
.forEach(([code, count]) => {
console.log(` ${code}: ${count}`);
});
console.log('\nErrors by Source:');
Object.entries(errorSources)
.sort(([,a], [,b]) => b - a)
.forEach(([source, count]) => {
console.log(` ${source}: ${count}`);
});
console.log('\nError Timeline (hourly):');
Object.entries(errorTimeline)
.sort(([a], [b]) => a.localeCompare(b))
.forEach(([hour, count]) => {
console.log(` ${hour}: ${count} errors`);
});
return {
errorCodes,
errorSources,
errorTimeline,
totalErrors: errorLogs.logs.length
};
}
// Performance metrics from logs
async function analyzePerformanceFromLogs(endpointId, hoursBack = 6) {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
const perfLogs = await client.endpoints.getLogs(endpointId, {
startTime: startTime.toISOString(),
endTime: endTime.toISOString(),
search: 'duration|latency|performance|completed',
level: 'info',
limit: 10000
});
const durations = [];
const memoryUsage = [];
const gpuUtilization = [];
perfLogs.logs.forEach(log => {
if (log.duration) {
durations.push(log.duration);
}
if (log.metadata) {
// Extract memory usage
if (log.metadata.memoryUsage && typeof log.metadata.memoryUsage === 'string') {
const memoryMatch = log.metadata.memoryUsage.match(/(\d+\.?\d*)GB/);
if (memoryMatch) {
memoryUsage.push(parseFloat(memoryMatch[1]));
}
}
// Extract GPU utilization
if (typeof log.metadata.gpuUtilization === 'number') {
gpuUtilization.push(log.metadata.gpuUtilization);
}
}
});
const analysis = {
durations: {
count: durations.length,
average: durations.length ? durations.reduce((a, b) => a + b, 0) / durations.length : 0,
min: durations.length ? Math.min(...durations) : 0,
max: durations.length ? Math.max(...durations) : 0,
p95: durations.length ? percentile(durations, 95) : 0
},
memory: {
count: memoryUsage.length,
average: memoryUsage.length ? memoryUsage.reduce((a, b) => a + b, 0) / memoryUsage.length : 0,
peak: memoryUsage.length ? Math.max(...memoryUsage) : 0
},
gpu: {
count: gpuUtilization.length,
average: gpuUtilization.length ? gpuUtilization.reduce((a, b) => a + b, 0) / gpuUtilization.length : 0,
peak: gpuUtilization.length ? Math.max(...gpuUtilization) : 0
}
};
console.log('Performance Analysis:');
console.log(` Average Duration: ${analysis.durations.average.toFixed(2)}s`);
console.log(` P95 Duration: ${analysis.durations.p95.toFixed(2)}s`);
console.log(` Peak Memory Usage: ${analysis.memory.peak.toFixed(1)}GB`);
console.log(` Average GPU Utilization: ${analysis.gpu.average.toFixed(1)}%`);
return analysis;
}
// Helper function to calculate percentiles
function percentile(arr, p) {
const sorted = [...arr].sort((a, b) => a - b);
const index = (p / 100) * (sorted.length - 1);
const lower = Math.floor(index);
const upper = Math.ceil(index);
if (lower === upper) {
return sorted[lower];
}
const weight = index - lower;
return sorted[lower] * (1 - weight) + sorted[upper] * weight;
}
// Log export functionality
async function exportLogs(endpointId, format = 'json', hoursBack = 24) {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - (hoursBack * 60 * 60 * 1000));
const logs = await client.endpoints.getLogs(endpointId, {
startTime: startTime.toISOString(),
endTime: endTime.toISOString(),
format: format,
limit: 10000
});
const filename = `${endpointId}_logs_${startTime.toISOString().split('T')[0]}.${format}`;
if (format === 'json') {
fs.writeFileSync(filename, JSON.stringify(logs, null, 2));
} else {
fs.writeFileSync(filename, logs);
}
console.log(`Logs exported to ${filename}`);
return filename;
}
// Automated error alerting
class LogAlerting {
constructor(endpointIds, options = {}) {
this.endpointIds = endpointIds;
this.checkInterval = options.checkInterval || 300000; // 5 minutes
this.errorThreshold = options.errorThreshold || 5;
this.callbacks = {
onAlert: options.onAlert || this.defaultAlertHandler,
onError: options.onError || console.error
};
this.isRunning = false;
this.intervalId = null;
}
defaultAlertHandler(alerts) {
console.log(`\n🚨 ERROR ALERTS - ${new Date().toISOString()}`);
alerts.forEach(alert => {
console.log(`Endpoint ${alert.endpointId}: ${alert.errorCount} new errors`);
alert.recentErrors.slice(0, 3).forEach(error => {
console.log(` - ${error.message}`);
});
});
}
async checkForErrors() {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - this.checkInterval);
const alerts = [];
for (const endpointId of this.endpointIds) {
try {
const errorLogs = await client.endpoints.getLogs(endpointId, {
startTime: startTime.toISOString(),
endTime: endTime.toISOString(),
level: 'error',
limit: 100
});
if (errorLogs.logs.length >= this.errorThreshold) {
alerts.push({
endpointId,
errorCount: errorLogs.logs.length,
recentErrors: errorLogs.logs
});
}
} catch (error) {
this.callbacks.onError(`Error checking logs for ${endpointId}:`, error);
}
}
if (alerts.length > 0) {
this.callbacks.onAlert(alerts);
}
return alerts;
}
start() {
if (this.isRunning) {
console.log('Alerting is already running');
return;
}
console.log(`Starting error monitoring for ${this.endpointIds.length} endpoints...`);
console.log(`Check interval: ${this.checkInterval / 1000} seconds`);
console.log(`Error threshold: ${this.errorThreshold} errors per interval`);
this.isRunning = true;
this.intervalId = setInterval(() => {
this.checkForErrors();
}, this.checkInterval);
// Initial check
this.checkForErrors();
}
stop() {
if (!this.isRunning) {
console.log('Alerting is not running');
return;
}
console.log('Stopping error monitoring...');
this.isRunning = false;
if (this.intervalId) {
clearInterval(this.intervalId);
this.intervalId = null;
}
}
}
// Usage examples
async function main() {
const endpointId = "ep_1234567890abcdef";
const endpointIds = ["ep_1234567890abcdef", "ep_2345678901bcdefg"];
try {
// Basic log retrieval
const logs = await getEndpointLogs(endpointId, 2);
// Error analysis
const errorAnalysis = await analyzeErrorPatterns(endpointId, 24);
// Performance analysis
const perfAnalysis = await analyzePerformanceFromLogs(endpointId, 6);
// Export logs
await exportLogs(endpointId, 'json', 12);
// Set up automated alerting
const alerting = new LogAlerting(endpointIds, {
checkInterval: 300000, // 5 minutes
errorThreshold: 3,
onAlert: (alerts) => {
// Custom alert handling
console.log('Custom alert handler triggered!');
alerts.forEach(alert => {
console.log(`🔥 ${alert.endpointId}: ${alert.errorCount} errors!`);
});
}
});
// Start alerting (uncomment to run)
// alerting.start();
// Real-time monitoring (uncomment to run)
// monitorLogsRealtime(endpointId, {
// levelFilter: 'info',
// onLog: (log) => {
// console.log(`[${log.timestamp}] ${log.level}: ${log.message}`);
// }
// });
} catch (error) {
console.error("Log analysis error:", error);
}
}
main();
Use Cases
Production Debugging
- Error Investigation: Quickly identify and analyze production errors
- Performance Troubleshooting: Diagnose latency and throughput issues
- Resource Problems: Monitor memory leaks and resource exhaustion
- Integration Issues: Debug API calls and external service failures
Development and Testing
- Development Debugging: Monitor application behavior during development
- Load Testing: Analyze system behavior under load
- Performance Optimization: Identify optimization opportunities
- Quality Assurance: Verify correct application behavior
Operations and Monitoring
- Real-time Monitoring: Monitor system health in real-time
- Alerting Systems: Set up automated alerts for critical issues
- Compliance Auditing: Maintain audit trails for compliance requirements
- Capacity Planning: Analyze usage patterns for capacity planning
Business Intelligence
- Usage Analytics: Understand user behavior and usage patterns
- Performance Metrics: Track application performance over time
- Cost Analysis: Analyze operational costs and optimization opportunities
- Trend Analysis: Identify patterns and trends in application usage
Best Practices
Log Management
- Structured Logging: Use structured log formats for easier analysis
- Log Levels: Use appropriate log levels to control verbosity
- Retention Policies: Define retention policies based on compliance requirements
- Storage Optimization: Use appropriate storage tiers for different log types
Performance Considerations
- Filtering: Use specific filters to reduce data transfer and processing
- Pagination: Use pagination for large log sets to avoid timeouts
- Streaming: Use streaming for real-time monitoring instead of polling
- Caching: Cache log analysis results when appropriate
Security and Compliance
- Access Control: Implement proper access controls for sensitive logs
- Data Privacy: Ensure logs don’t contain sensitive personal information
- Audit Trails: Maintain audit trails for log access and modifications
- Encryption: Use encryption for sensitive log data
Monitoring and Alerting
- Proactive Monitoring: Set up proactive monitoring for critical issues
- Alert Thresholds: Set appropriate thresholds to avoid alert fatigue
- Escalation Procedures: Define clear escalation procedures for different alert types
- Integration: Integrate with existing monitoring and alerting systems
Logs are retained for 30 days by default. For longer retention, consider exporting logs to your own storage
system or upgrading to a plan with extended retention.
Log streaming consumes resources and should be used judiciously. Close streaming connections when not needed
to avoid unnecessary resource usage.
Use structured search patterns and metadata filtering to quickly find relevant logs. Consider setting up
automated log analysis pipelines for common debugging scenarios.
Authorizations
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Path Parameters
Query Parameters
Number of log lines to retrieve
Required range:
x <= 1000
Response
200 - application/json
Endpoint logs
The response is of type object
.