Cluster Logs

Overview

Cluster Logs provide comprehensive logging capabilities for GPU clusters including system logs, application logs, error logs, and audit trails. Essential for debugging issues, monitoring application behavior, and maintaining audit compliance.

Endpoints

Get Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs

Stream Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/stream

Search Logs

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/search

Download Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/download

Get Logs

Query Parameters

Parameter	Type	Required	Description
`log_type`	string	No	Log type: `system`, `application`, `error`, `audit`, `docker`, `gpu`
`level`	string	No	Log level: `debug`, `info`, `warn`, `error`, `fatal`
`start_time`	string	No	Start time (ISO 8601 format)
`end_time`	string	No	End time (ISO 8601 format)
`limit`	integer	No	Maximum number of log entries (default: 100, max: 1000)
`offset`	integer	No	Number of entries to skip (for pagination)
`search`	string	No	Search term to filter logs
`source`	string	No	Log source: `container`, `host`, `gpu_driver`, `application`
`format`	string	No	Response format: `json`, `text` (default: `json`)
`tail`	boolean	No	Get most recent logs first (default: true)

Log Types

Type	Description	Sources
`system`	System-level logs (kernel, services, drivers)	syslog, systemd, kernel
`application`	Application and user process logs	stdout, stderr, app logs
`error`	Error and exception logs	error handlers, crash reports
`audit`	Security and access audit logs	auth, API access, file access
`docker`	Container runtime logs	Docker daemon, container logs
`gpu`	GPU driver and CUDA logs	nvidia-smi, CUDA runtime

Request Examples

# Get recent application logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=application&limit=50" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get error logs from last hour
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=error&level=error&start_time=2024-01-15T15:00:00Z" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Search for specific terms in logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?search=CUDA&log_type=gpu" \
  -H "Authorization: Bearer YOUR_API_KEY"

Log Search

Advanced log search capabilities for complex queries and filtering.

# Search logs with complex query
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/search" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "level:error AND (CUDA OR memory)",
    "time_range": {
      "start": "2024-01-15T00:00:00Z",
      "end": "2024-01-15T23:59:59Z"
    },
    "limit": 100,
    "sort": "timestamp desc"
  }'

Response Schema

{
  "success": true,
  "data": {
    "cluster_id": "cluster_abc123",
    "logs": [
      {
        "id": "log_entry_123",
        "timestamp": "2024-01-15T16:45:23.456Z",
        "level": "info",
        "source": "application",
        "log_type": "application",
        "message": "Training step 1500: loss=0.234, accuracy=94.2%",
        "process": {
          "pid": 1234,
          "name": "python",
          "command": "python train.py --model gpt2"
        },
        "metadata": {
          "host": "gpu-node-01",
          "container_id": "abc123def456",
          "thread_id": "MainThread",
          "file": "train.py",
          "line": 142
        },
        "tags": ["training", "metrics"]
      },
      {
        "id": "log_entry_124",
        "timestamp": "2024-01-15T16:45:20.123Z",
        "level": "error",
        "source": "gpu_driver",
        "log_type": "gpu",
        "message": "CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 79.20 GiB total capacity)",
        "process": {
          "pid": 1234,
          "name": "python"
        },
        "metadata": {
          "host": "gpu-node-01",
          "gpu_id": 0,
          "cuda_version": "12.1",
          "driver_version": "530.30.02"
        },
        "tags": ["cuda", "memory", "error"],
        "stack_trace": [
          "File \"train.py\", line 89, in forward",
          "return self.model(inputs)",
          "RuntimeError: CUDA out of memory"
        ]
      }
    ],
    "pagination": {
      "total_count": 15420,
      "current_page": 1,
      "total_pages": 155,
      "has_next": true,
      "has_previous": false
    },
    "query_info": {
      "log_type": "application",
      "time_range": {
        "start": "2024-01-15T16:00:00Z",
        "end": "2024-01-15T17:00:00Z"
      },
      "filters_applied": ["log_type", "time_range"],
      "search_terms": null
    }
  },
  "meta": {
    "request_id": "req_logs_456",
    "query_time_ms": 127,
    "log_retention_days": 30
  }
}

Download Logs

Export logs for offline analysis or archival purposes.

# Download logs as compressed archive
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/download?format=json&compression=gzip&start_time=2024-01-15T00:00:00Z&end_time=2024-01-15T23:59:59Z" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output cluster_logs.json.gz

Use Cases

Training Progress Monitoring

Monitor ML training progress through application logs.

class TrainingProgressMonitor:
    def __init__(self, cluster_id):
        self.cluster_id = cluster_id
        self.progress_history = []
        self.alert_thresholds = {
            "loss_stagnation_steps": 100,
            "error_rate_threshold": 0.1,
            "memory_usage_threshold": 0.95
        }
    
    def monitor_training_session(self, duration_minutes=60, check_interval=30):
        """Monitor training session with periodic checks"""
        
        import time, threading
        
        self.monitoring = True
        end_time = time.time() + (duration_minutes * 60)
        
        def monitoring_loop():
            while self.monitoring and time.time() < end_time:
                try:
                    self.check_training_progress()
                    time.sleep(check_interval)
                except Exception as e:
                    print(f"Monitoring error: {e}")
                    time.sleep(check_interval)
        
        monitor_thread = threading.Thread(target=monitoring_loop)
        monitor_thread.daemon = True
        monitor_thread.start()
        
        return monitor_thread
    
    def check_training_progress(self):
        """Check current training progress from logs"""
        
        # Get recent application logs
        logs = get_cluster_logs(
            self.cluster_id,
            log_type="application",
            limit=50,
            search="loss"
        )
        
        if not logs["success"]:
            return
        
        current_progress = {
            "timestamp": time.time(),
            "loss_values": [],
            "accuracy_values": [],
            "error_count": 0,
            "memory_warnings": 0
        }
        
        for log_entry in logs["data"]["logs"]:
            message = log_entry["message"]
            
            # Extract loss values
            loss_match = re.search(r'loss[:\s=]+([0-9.]+)', message, re.IGNORECASE)
            if loss_match:
                current_progress["loss_values"].append(float(loss_match.group(1)))
            
            # Extract accuracy values
            acc_match = re.search(r'acc(?:uracy)?[:\s=]+([0-9.]+)', message, re.IGNORECASE)
            if acc_match:
                current_progress["accuracy_values"].append(float(acc_match.group(1)))
            
            # Count errors
            if log_entry["level"] in ["error", "fatal"]:
                current_progress["error_count"] += 1
            
            # Check for memory warnings
            if "memory" in message.lower() and "warning" in log_entry["level"]:
                current_progress["memory_warnings"] += 1
        
        self.progress_history.append(current_progress)
        
        # Check for alerts
        self.check_training_alerts(current_progress)
        
        # Print progress summary
        if current_progress["loss_values"]:
            avg_loss = sum(current_progress["loss_values"]) / len(current_progress["loss_values"])
            print(f"[{datetime.now().strftime('%H:%M:%S')}] Avg Loss: {avg_loss:.4f}, Errors: {current_progress['error_count']}")
    
    def check_training_alerts(self, current_progress):
        """Check for training issues and generate alerts"""
        
        alerts = []
        
        # Check loss stagnation
        if len(self.progress_history) >= 3:
            recent_losses = []
            for progress in self.progress_history[-3:]:
                if progress["loss_values"]:
                    recent_losses.extend(progress["loss_values"])
            
            if len(recent_losses) >= 5:
                loss_variance = statistics.variance(recent_losses)
                if loss_variance < 0.0001:  # Very low variance indicates stagnation
                    alerts.append("Loss appears to be stagnating")
        
        # Check error rate
        total_errors = sum(p["error_count"] for p in self.progress_history[-5:])
        if total_errors > self.alert_thresholds["error_rate_threshold"] * 50:  # Assuming 50 logs checked
            alerts.append(f"High error rate detected: {total_errors} errors in recent logs")
        
        # Check memory warnings
        if current_progress["memory_warnings"] > 0:
            alerts.append(f"Memory warnings detected: {current_progress['memory_warnings']}")
        
        if alerts:
            print(f"🚨 TRAINING ALERTS:")
            for alert in alerts:
                print(f"   • {alert}")
    
    def stop_monitoring(self):
        """Stop the monitoring process"""
        self.monitoring = False
    
    def get_training_summary(self):
        """Get comprehensive training summary"""
        
        if not self.progress_history:
            return {"error": "No progress data available"}
        
        all_losses = []
        all_accuracies = []
        total_errors = 0
        
        for progress in self.progress_history:
            all_losses.extend(progress["loss_values"])
            all_accuracies.extend(progress["accuracy_values"])
            total_errors += progress["error_count"]
        
        summary = {
            "monitoring_duration_minutes": (time.time() - self.progress_history[0]["timestamp"]) / 60,
            "total_progress_checks": len(self.progress_history),
            "loss_analysis": {},
            "accuracy_analysis": {},
            "error_analysis": {
                "total_errors": total_errors,
                "error_rate": total_errors / len(self.progress_history)
            }
        }
        
        if all_losses:
            summary["loss_analysis"] = {
                "first_loss": all_losses[-1],
                "last_loss": all_losses[0],
                "min_loss": min(all_losses),
                "avg_loss": sum(all_losses) / len(all_losses),
                "loss_improvement": all_losses[-1] - all_losses[0],
                "loss_trend": "improving" if all_losses[0] < all_losses[-1] else "degrading"
            }
        
        if all_accuracies:
            summary["accuracy_analysis"] = {
                "first_accuracy": all_accuracies[-1],
                "last_accuracy": all_accuracies[0],
                "max_accuracy": max(all_accuracies),
                "avg_accuracy": sum(all_accuracies) / len(all_accuracies),
                "accuracy_improvement": all_accuracies[0] - all_accuracies[-1]
            }
        
        return summary

# Usage
monitor = TrainingProgressMonitor("cluster_training_001")
monitor_thread = monitor.monitor_training_session(duration_minutes=120, check_interval=60)

# Get summary after training
summary = monitor.get_training_summary()
print("Training Summary:", json.dumps(summary, indent=2))

Error Analysis and Debugging

Comprehensive error analysis for troubleshooting cluster issues.

class ClusterDebugger {
  constructor(clusterId) {
    this.clusterId = clusterId;
  }
  
  async performComprehensiveDebug() {
    console.log(`🔍 Starting comprehensive debug for cluster ${this.clusterId}`);
    
    const debugReport = {
      cluster_id: this.clusterId,
      debug_timestamp: new Date().toISOString(),
      analysis: {
        system_health: await this.analyzeSystemHealth(),
        gpu_health: await this.analyzeGPUHealth(),
        application_health: await this.analyzeApplicationHealth(),
        network_health: await this.analyzeNetworkHealth(),
        resource_usage: await this.analyzeResourceUsage()
      },
      recommendations: []
    };
    
    // Generate recommendations based on findings
    debugReport.recommendations = this.generateRecommendations(debugReport.analysis);
    
    return debugReport;
  }
  
  async analyzeSystemHealth() {
    const systemLogs = await getClusterLogs(this.clusterId, {
      log_type: 'system',
      level: 'error',
      limit: 100,
      start_time: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString() // Last 2 hours
    });
    
    const analysis = {
      error_count: 0,
      critical_issues: [],
      kernel_issues: [],
      service_issues: [],
      health_score: 100
    };
    
    if (systemLogs.success) {
      analysis.error_count = systemLogs.data.logs.length;
      
      systemLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('kernel') || message.includes('panic')) {
          analysis.kernel_issues.push(log);
          analysis.health_score -= 20;
        } else if (message.includes('service') || message.includes('daemon')) {
          analysis.service_issues.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('critical') || message.includes('fatal')) {
          analysis.critical_issues.push(log);
          analysis.health_score -= 15;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeGPUHealth() {
    const gpuLogs = await getClusterLogs(this.clusterId, {
      log_type: 'gpu',
      limit: 200,
      start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString() // Last hour
    });
    
    const analysis = {
      cuda_errors: [],
      memory_errors: [],
      temperature_warnings: [],
      driver_issues: [],
      health_score: 100
    };
    
    if (gpuLogs.success) {
      gpuLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('cuda') && message.includes('error')) {
          analysis.cuda_errors.push(log);
          analysis.health_score -= 15;
        } else if (message.includes('out of memory') || message.includes('oom')) {
          analysis.memory_errors.push(log);
          analysis.health_score -= 20;
        } else if (message.includes('temperature') && message.includes('high')) {
          analysis.temperature_warnings.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('driver')) {
          analysis.driver_issues.push(log);
          analysis.health_score -= 10;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeApplicationHealth() {
    // Search for application errors and exceptions
    const searchResults = await this.searchLogs(
      'level:error AND source:application',
      {
        start: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString(),
        end: new Date().toISOString()
      }
    );
    
    const analysis = {
      total_errors: 0,
      exception_types: {},
      frequent_errors: {},
      performance_issues: [],
      health_score: 100
    };
    
    if (searchResults.success) {
      analysis.total_errors = searchResults.data.total_matches;
      
      searchResults.data.logs.forEach(log => {
        const message = log.message;
        
        // Categorize exceptions
        const exceptionMatch = message.match(/(\w+Error|\w+Exception)/);
        if (exceptionMatch) {
          const exceptionType = exceptionMatch[1];
          analysis.exception_types[exceptionType] = 
            (analysis.exception_types[exceptionType] || 0) + 1;
        }
        
        // Look for performance issues
        if (message.includes('timeout') || message.includes('slow') || 
            message.includes('performance')) {
          analysis.performance_issues.push(log);
        }
        
        // Count frequent error patterns
        const errorPattern = message.substring(0, 50);
        analysis.frequent_errors[errorPattern] = 
          (analysis.frequent_errors[errorPattern] || 0) + 1;
      });
      
      // Adjust health score based on error frequency
      if (analysis.total_errors > 50) {
        analysis.health_score -= 30;
      } else if (analysis.total_errors > 20) {
        analysis.health_score -= 20;
      } else if (analysis.total_errors > 10) {
        analysis.health_score -= 10;
      }
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeNetworkHealth() {
    const networkLogs = await getClusterLogs(this.clusterId, {
      search: 'network',
      level: 'error',
      limit: 50,
      start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString()
    });
    
    const analysis = {
      connection_errors: [],
      timeout_errors: [],
      dns_issues: [],
      health_score: 100
    };
    
    if (networkLogs.success) {
      networkLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('connection') && message.includes('error')) {
          analysis.connection_errors.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('timeout')) {
          analysis.timeout_errors.push(log);
          analysis.health_score -= 5;
        } else if (message.includes('dns') || message.includes('resolve')) {
          analysis.dns_issues.push(log);
          analysis.health_score -= 5;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 90 ? 'healthy' : 
                     analysis.health_score > 70 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeResourceUsage() {
    // Look for resource-related log messages
    const resourceLogs = await getClusterLogs(this.clusterId, {
      search: 'memory OR disk OR cpu',
      limit: 100,
      start_time: new Date(Date.now() - 30 * 60 * 1000).toISOString() // Last 30 minutes
    });
    
    const analysis = {
      memory_warnings: [],
      disk_warnings: [],
      cpu_warnings: [],
      resource_exhaustion: []
    };
    
    if (resourceLogs.success) {
      resourceLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('memory') && (message.includes('low') || message.includes('warning'))) {
          analysis.memory_warnings.push(log);
        } else if (message.includes('disk') && (message.includes('full') || message.includes('space'))) {
          analysis.disk_warnings.push(log);
        } else if (message.includes('cpu') && message.includes('high')) {
          analysis.cpu_warnings.push(log);
        } else if (message.includes('exhausted') || message.includes('limit')) {
          analysis.resource_exhaustion.push(log);
        }
      });
    }
    
    return analysis;
  }
  
  generateRecommendations(analysis) {
    const recommendations = [];
    
    // System health recommendations
    if (analysis.system_health.status === 'critical') {
      recommendations.push({
        priority: 'high',
        category: 'system',
        title: 'Critical System Issues Detected',
        description: `${analysis.system_health.critical_issues.length} critical system issues found`,
        action: 'Restart cluster or contact support immediately'
      });
    }
    
    // GPU health recommendations
    if (analysis.gpu_health.cuda_errors.length > 0) {
      recommendations.push({
        priority: 'high',
        category: 'gpu',
        title: 'CUDA Errors Detected',
        description: `${analysis.gpu_health.cuda_errors.length} CUDA errors found`,
        action: 'Check CUDA version compatibility and driver installation'
      });
    }
    
    if (analysis.gpu_health.memory_errors.length > 0) {
      recommendations.push({
        priority: 'medium',
        category: 'gpu',
        title: 'GPU Memory Issues',
        description: `${analysis.gpu_health.memory_errors.length} memory errors found`,
        action: 'Reduce batch size or model complexity, or upgrade to higher memory GPU'
      });
    }
    
    // Application health recommendations
    if (analysis.application_health.total_errors > 20) {
      recommendations.push({
        priority: 'medium',
        category: 'application',
        title: 'High Application Error Rate',
        description: `${analysis.application_health.total_errors} application errors in last 2 hours`,
        action: 'Review application logs and fix recurring issues'
      });
    }
    
    return recommendations;
  }
  
  async searchLogs(query, timeRange) {
    const response = await fetch(
      `https://api.tensorone.ai/v1/clusters/${this.clusterId}/logs/search`,
      {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer YOUR_API_KEY',
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          query: query,
          time_range: timeRange,
          limit: 100
        })
      }
    );
    
    return await response.json();
  }
  
  async generateDebugReport() {
    const report = await this.performComprehensiveDebug();
    
    console.log('\n=== Cluster Debug Report ===');
    console.log(`Cluster: ${report.cluster_id}`);
    console.log(`Analysis Time: ${report.debug_timestamp}`);
    
    // Overall health summary
    const healthScores = [
      report.analysis.system_health.health_score,
      report.analysis.gpu_health.health_score,
      report.analysis.application_health.health_score,
      report.analysis.network_health.health_score
    ];
    
    const overallHealth = healthScores.reduce((sum, score) => sum + score, 0) / healthScores.length;
    
    console.log(`\n📊 Overall Health Score: ${overallHealth.toFixed(1)}/100`);
    
    // Component health
    console.log('\n🔧 Component Health:');
    console.log(`  System: ${report.analysis.system_health.status} (${report.analysis.system_health.health_score}/100)`);
    console.log(`  GPU: ${report.analysis.gpu_health.status} (${report.analysis.gpu_health.health_score}/100)`);
    console.log(`  Application: ${report.analysis.application_health.status} (${report.analysis.application_health.health_score}/100)`);
    console.log(`  Network: ${report.analysis.network_health.status} (${report.analysis.network_health.health_score}/100)`);
    
    // Recommendations
    if (report.recommendations.length > 0) {
      console.log('\n💡 Recommendations:');
      report.recommendations.forEach((rec, index) => {
        console.log(`  ${index + 1}. [${rec.priority.toUpperCase()}] ${rec.title}`);
        console.log(`     ${rec.description}`);
        console.log(`     Action: ${rec.action}`);
      });
    } else {
      console.log('\n✅ No critical issues found');
    }
    
    return report;
  }
}

// Usage
const debugger = new ClusterDebugger('cluster_abc123');
await debugger.generateDebugReport();

Error Handling

{
  "success": false,
  "error": {
    "code": "LOGS_NOT_AVAILABLE",
    "message": "Logs are not available for the specified time range",
    "details": {
      "requested_start": "2024-01-01T00:00:00Z",
      "requested_end": "2024-01-02T00:00:00Z",
      "available_start": "2024-01-14T00:00:00Z",
      "log_retention_days": 30
    }
  }
}

Best Practices

Log Level Management: Use appropriate log levels to reduce noise
Time Range Optimization: Use specific time ranges for better performance
Search Efficiency: Use structured queries for complex log analysis
Regular Monitoring: Set up automated log monitoring for critical issues
Log Retention: Understand log retention policies for your use case
Export Strategy: Regular export of important logs for compliance and analysis

Authorizations

Authorization

string

header

required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id

string

required

Query Parameters

lines

integer

default:100

Number of log lines to retrieve

Required range: x <= 1000

Response

200 - application/json

Cluster logs

The response is of type object.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Overview

Endpoints

Get Logs

Stream Logs

Search Logs

Download Logs

Get Logs

Query Parameters

Log Types

Request Examples

Log Search

Response Schema

Download Logs

Use Cases

Training Progress Monitoring

Error Analysis and Debugging

Error Handling

Best Practices

Authorizations

Path Parameters

Query Parameters

Response

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Overview

​Endpoints

​Get Logs

​Stream Logs

​Search Logs

​Download Logs

​Get Logs

​Query Parameters

​Log Types

​Request Examples

​Log Search

​Response Schema

​Download Logs

​Use Cases

​Training Progress Monitoring

​Error Analysis and Debugging

​Error Handling

​Best Practices

Authorizations

Path Parameters

Query Parameters

Response

Overview

Endpoints

Get Logs

Stream Logs

Search Logs

Download Logs

Get Logs

Query Parameters

Log Types

Request Examples

Log Search

Response Schema

Download Logs

Use Cases

Training Progress Monitoring

Error Analysis and Debugging

Error Handling

Best Practices