Get Cluster Logs
curl --request GET \
  --url https://api.tensorone.ai/v2/clusters/{cluster_id}/logs \
  --header 'Authorization: <api-key>'
{
  "logs": [
    "<string>"
  ]
}

Overview

Cluster Logs provide comprehensive logging capabilities for GPU clusters including system logs, application logs, error logs, and audit trails. Essential for debugging issues, monitoring application behavior, and maintaining audit compliance.

Endpoints

Get Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs

Stream Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/stream

Search Logs

POST https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/search

Download Logs

GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/download

Get Logs

Query Parameters

ParameterTypeRequiredDescription
log_typestringNoLog type: system, application, error, audit, docker, gpu
levelstringNoLog level: debug, info, warn, error, fatal
start_timestringNoStart time (ISO 8601 format)
end_timestringNoEnd time (ISO 8601 format)
limitintegerNoMaximum number of log entries (default: 100, max: 1000)
offsetintegerNoNumber of entries to skip (for pagination)
searchstringNoSearch term to filter logs
sourcestringNoLog source: container, host, gpu_driver, application
formatstringNoResponse format: json, text (default: json)
tailbooleanNoGet most recent logs first (default: true)

Log Types

TypeDescriptionSources
systemSystem-level logs (kernel, services, drivers)syslog, systemd, kernel
applicationApplication and user process logsstdout, stderr, app logs
errorError and exception logserror handlers, crash reports
auditSecurity and access audit logsauth, API access, file access
dockerContainer runtime logsDocker daemon, container logs
gpuGPU driver and CUDA logsnvidia-smi, CUDA runtime

Request Examples

# Get recent application logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=application&limit=50" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get error logs from last hour
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=error&level=error&start_time=2024-01-15T15:00:00Z" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Search for specific terms in logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?search=CUDA&log_type=gpu" \
  -H "Authorization: Bearer YOUR_API_KEY"
Advanced log search capabilities for complex queries and filtering.
# Search logs with complex query
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/search" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "level:error AND (CUDA OR memory)",
    "time_range": {
      "start": "2024-01-15T00:00:00Z",
      "end": "2024-01-15T23:59:59Z"
    },
    "limit": 100,
    "sort": "timestamp desc"
  }'

Response Schema

{
  "success": true,
  "data": {
    "cluster_id": "cluster_abc123",
    "logs": [
      {
        "id": "log_entry_123",
        "timestamp": "2024-01-15T16:45:23.456Z",
        "level": "info",
        "source": "application",
        "log_type": "application",
        "message": "Training step 1500: loss=0.234, accuracy=94.2%",
        "process": {
          "pid": 1234,
          "name": "python",
          "command": "python train.py --model gpt2"
        },
        "metadata": {
          "host": "gpu-node-01",
          "container_id": "abc123def456",
          "thread_id": "MainThread",
          "file": "train.py",
          "line": 142
        },
        "tags": ["training", "metrics"]
      },
      {
        "id": "log_entry_124",
        "timestamp": "2024-01-15T16:45:20.123Z",
        "level": "error",
        "source": "gpu_driver",
        "log_type": "gpu",
        "message": "CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 79.20 GiB total capacity)",
        "process": {
          "pid": 1234,
          "name": "python"
        },
        "metadata": {
          "host": "gpu-node-01",
          "gpu_id": 0,
          "cuda_version": "12.1",
          "driver_version": "530.30.02"
        },
        "tags": ["cuda", "memory", "error"],
        "stack_trace": [
          "File \"train.py\", line 89, in forward",
          "return self.model(inputs)",
          "RuntimeError: CUDA out of memory"
        ]
      }
    ],
    "pagination": {
      "total_count": 15420,
      "current_page": 1,
      "total_pages": 155,
      "has_next": true,
      "has_previous": false
    },
    "query_info": {
      "log_type": "application",
      "time_range": {
        "start": "2024-01-15T16:00:00Z",
        "end": "2024-01-15T17:00:00Z"
      },
      "filters_applied": ["log_type", "time_range"],
      "search_terms": null
    }
  },
  "meta": {
    "request_id": "req_logs_456",
    "query_time_ms": 127,
    "log_retention_days": 30
  }
}

Download Logs

Export logs for offline analysis or archival purposes.
# Download logs as compressed archive
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/download?format=json&compression=gzip&start_time=2024-01-15T00:00:00Z&end_time=2024-01-15T23:59:59Z" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  --output cluster_logs.json.gz

Use Cases

Training Progress Monitoring

Monitor ML training progress through application logs.
class TrainingProgressMonitor:
    def __init__(self, cluster_id):
        self.cluster_id = cluster_id
        self.progress_history = []
        self.alert_thresholds = {
            "loss_stagnation_steps": 100,
            "error_rate_threshold": 0.1,
            "memory_usage_threshold": 0.95
        }
    
    def monitor_training_session(self, duration_minutes=60, check_interval=30):
        """Monitor training session with periodic checks"""
        
        import time, threading
        
        self.monitoring = True
        end_time = time.time() + (duration_minutes * 60)
        
        def monitoring_loop():
            while self.monitoring and time.time() < end_time:
                try:
                    self.check_training_progress()
                    time.sleep(check_interval)
                except Exception as e:
                    print(f"Monitoring error: {e}")
                    time.sleep(check_interval)
        
        monitor_thread = threading.Thread(target=monitoring_loop)
        monitor_thread.daemon = True
        monitor_thread.start()
        
        return monitor_thread
    
    def check_training_progress(self):
        """Check current training progress from logs"""
        
        # Get recent application logs
        logs = get_cluster_logs(
            self.cluster_id,
            log_type="application",
            limit=50,
            search="loss"
        )
        
        if not logs["success"]:
            return
        
        current_progress = {
            "timestamp": time.time(),
            "loss_values": [],
            "accuracy_values": [],
            "error_count": 0,
            "memory_warnings": 0
        }
        
        for log_entry in logs["data"]["logs"]:
            message = log_entry["message"]
            
            # Extract loss values
            loss_match = re.search(r'loss[:\s=]+([0-9.]+)', message, re.IGNORECASE)
            if loss_match:
                current_progress["loss_values"].append(float(loss_match.group(1)))
            
            # Extract accuracy values
            acc_match = re.search(r'acc(?:uracy)?[:\s=]+([0-9.]+)', message, re.IGNORECASE)
            if acc_match:
                current_progress["accuracy_values"].append(float(acc_match.group(1)))
            
            # Count errors
            if log_entry["level"] in ["error", "fatal"]:
                current_progress["error_count"] += 1
            
            # Check for memory warnings
            if "memory" in message.lower() and "warning" in log_entry["level"]:
                current_progress["memory_warnings"] += 1
        
        self.progress_history.append(current_progress)
        
        # Check for alerts
        self.check_training_alerts(current_progress)
        
        # Print progress summary
        if current_progress["loss_values"]:
            avg_loss = sum(current_progress["loss_values"]) / len(current_progress["loss_values"])
            print(f"[{datetime.now().strftime('%H:%M:%S')}] Avg Loss: {avg_loss:.4f}, Errors: {current_progress['error_count']}")
    
    def check_training_alerts(self, current_progress):
        """Check for training issues and generate alerts"""
        
        alerts = []
        
        # Check loss stagnation
        if len(self.progress_history) >= 3:
            recent_losses = []
            for progress in self.progress_history[-3:]:
                if progress["loss_values"]:
                    recent_losses.extend(progress["loss_values"])
            
            if len(recent_losses) >= 5:
                loss_variance = statistics.variance(recent_losses)
                if loss_variance < 0.0001:  # Very low variance indicates stagnation
                    alerts.append("Loss appears to be stagnating")
        
        # Check error rate
        total_errors = sum(p["error_count"] for p in self.progress_history[-5:])
        if total_errors > self.alert_thresholds["error_rate_threshold"] * 50:  # Assuming 50 logs checked
            alerts.append(f"High error rate detected: {total_errors} errors in recent logs")
        
        # Check memory warnings
        if current_progress["memory_warnings"] > 0:
            alerts.append(f"Memory warnings detected: {current_progress['memory_warnings']}")
        
        if alerts:
            print(f"🚨 TRAINING ALERTS:")
            for alert in alerts:
                print(f"   • {alert}")
    
    def stop_monitoring(self):
        """Stop the monitoring process"""
        self.monitoring = False
    
    def get_training_summary(self):
        """Get comprehensive training summary"""
        
        if not self.progress_history:
            return {"error": "No progress data available"}
        
        all_losses = []
        all_accuracies = []
        total_errors = 0
        
        for progress in self.progress_history:
            all_losses.extend(progress["loss_values"])
            all_accuracies.extend(progress["accuracy_values"])
            total_errors += progress["error_count"]
        
        summary = {
            "monitoring_duration_minutes": (time.time() - self.progress_history[0]["timestamp"]) / 60,
            "total_progress_checks": len(self.progress_history),
            "loss_analysis": {},
            "accuracy_analysis": {},
            "error_analysis": {
                "total_errors": total_errors,
                "error_rate": total_errors / len(self.progress_history)
            }
        }
        
        if all_losses:
            summary["loss_analysis"] = {
                "first_loss": all_losses[-1],
                "last_loss": all_losses[0],
                "min_loss": min(all_losses),
                "avg_loss": sum(all_losses) / len(all_losses),
                "loss_improvement": all_losses[-1] - all_losses[0],
                "loss_trend": "improving" if all_losses[0] < all_losses[-1] else "degrading"
            }
        
        if all_accuracies:
            summary["accuracy_analysis"] = {
                "first_accuracy": all_accuracies[-1],
                "last_accuracy": all_accuracies[0],
                "max_accuracy": max(all_accuracies),
                "avg_accuracy": sum(all_accuracies) / len(all_accuracies),
                "accuracy_improvement": all_accuracies[0] - all_accuracies[-1]
            }
        
        return summary

# Usage
monitor = TrainingProgressMonitor("cluster_training_001")
monitor_thread = monitor.monitor_training_session(duration_minutes=120, check_interval=60)

# Get summary after training
summary = monitor.get_training_summary()
print("Training Summary:", json.dumps(summary, indent=2))

Error Analysis and Debugging

Comprehensive error analysis for troubleshooting cluster issues.
class ClusterDebugger {
  constructor(clusterId) {
    this.clusterId = clusterId;
  }
  
  async performComprehensiveDebug() {
    console.log(`🔍 Starting comprehensive debug for cluster ${this.clusterId}`);
    
    const debugReport = {
      cluster_id: this.clusterId,
      debug_timestamp: new Date().toISOString(),
      analysis: {
        system_health: await this.analyzeSystemHealth(),
        gpu_health: await this.analyzeGPUHealth(),
        application_health: await this.analyzeApplicationHealth(),
        network_health: await this.analyzeNetworkHealth(),
        resource_usage: await this.analyzeResourceUsage()
      },
      recommendations: []
    };
    
    // Generate recommendations based on findings
    debugReport.recommendations = this.generateRecommendations(debugReport.analysis);
    
    return debugReport;
  }
  
  async analyzeSystemHealth() {
    const systemLogs = await getClusterLogs(this.clusterId, {
      log_type: 'system',
      level: 'error',
      limit: 100,
      start_time: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString() // Last 2 hours
    });
    
    const analysis = {
      error_count: 0,
      critical_issues: [],
      kernel_issues: [],
      service_issues: [],
      health_score: 100
    };
    
    if (systemLogs.success) {
      analysis.error_count = systemLogs.data.logs.length;
      
      systemLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('kernel') || message.includes('panic')) {
          analysis.kernel_issues.push(log);
          analysis.health_score -= 20;
        } else if (message.includes('service') || message.includes('daemon')) {
          analysis.service_issues.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('critical') || message.includes('fatal')) {
          analysis.critical_issues.push(log);
          analysis.health_score -= 15;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeGPUHealth() {
    const gpuLogs = await getClusterLogs(this.clusterId, {
      log_type: 'gpu',
      limit: 200,
      start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString() // Last hour
    });
    
    const analysis = {
      cuda_errors: [],
      memory_errors: [],
      temperature_warnings: [],
      driver_issues: [],
      health_score: 100
    };
    
    if (gpuLogs.success) {
      gpuLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('cuda') && message.includes('error')) {
          analysis.cuda_errors.push(log);
          analysis.health_score -= 15;
        } else if (message.includes('out of memory') || message.includes('oom')) {
          analysis.memory_errors.push(log);
          analysis.health_score -= 20;
        } else if (message.includes('temperature') && message.includes('high')) {
          analysis.temperature_warnings.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('driver')) {
          analysis.driver_issues.push(log);
          analysis.health_score -= 10;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeApplicationHealth() {
    // Search for application errors and exceptions
    const searchResults = await this.searchLogs(
      'level:error AND source:application',
      {
        start: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString(),
        end: new Date().toISOString()
      }
    );
    
    const analysis = {
      total_errors: 0,
      exception_types: {},
      frequent_errors: {},
      performance_issues: [],
      health_score: 100
    };
    
    if (searchResults.success) {
      analysis.total_errors = searchResults.data.total_matches;
      
      searchResults.data.logs.forEach(log => {
        const message = log.message;
        
        // Categorize exceptions
        const exceptionMatch = message.match(/(\w+Error|\w+Exception)/);
        if (exceptionMatch) {
          const exceptionType = exceptionMatch[1];
          analysis.exception_types[exceptionType] = 
            (analysis.exception_types[exceptionType] || 0) + 1;
        }
        
        // Look for performance issues
        if (message.includes('timeout') || message.includes('slow') || 
            message.includes('performance')) {
          analysis.performance_issues.push(log);
        }
        
        // Count frequent error patterns
        const errorPattern = message.substring(0, 50);
        analysis.frequent_errors[errorPattern] = 
          (analysis.frequent_errors[errorPattern] || 0) + 1;
      });
      
      // Adjust health score based on error frequency
      if (analysis.total_errors > 50) {
        analysis.health_score -= 30;
      } else if (analysis.total_errors > 20) {
        analysis.health_score -= 20;
      } else if (analysis.total_errors > 10) {
        analysis.health_score -= 10;
      }
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 80 ? 'healthy' : 
                     analysis.health_score > 50 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeNetworkHealth() {
    const networkLogs = await getClusterLogs(this.clusterId, {
      search: 'network',
      level: 'error',
      limit: 50,
      start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString()
    });
    
    const analysis = {
      connection_errors: [],
      timeout_errors: [],
      dns_issues: [],
      health_score: 100
    };
    
    if (networkLogs.success) {
      networkLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('connection') && message.includes('error')) {
          analysis.connection_errors.push(log);
          analysis.health_score -= 10;
        } else if (message.includes('timeout')) {
          analysis.timeout_errors.push(log);
          analysis.health_score -= 5;
        } else if (message.includes('dns') || message.includes('resolve')) {
          analysis.dns_issues.push(log);
          analysis.health_score -= 5;
        }
      });
    }
    
    analysis.health_score = Math.max(0, analysis.health_score);
    analysis.status = analysis.health_score > 90 ? 'healthy' : 
                     analysis.health_score > 70 ? 'degraded' : 'critical';
    
    return analysis;
  }
  
  async analyzeResourceUsage() {
    // Look for resource-related log messages
    const resourceLogs = await getClusterLogs(this.clusterId, {
      search: 'memory OR disk OR cpu',
      limit: 100,
      start_time: new Date(Date.now() - 30 * 60 * 1000).toISOString() // Last 30 minutes
    });
    
    const analysis = {
      memory_warnings: [],
      disk_warnings: [],
      cpu_warnings: [],
      resource_exhaustion: []
    };
    
    if (resourceLogs.success) {
      resourceLogs.data.logs.forEach(log => {
        const message = log.message.toLowerCase();
        
        if (message.includes('memory') && (message.includes('low') || message.includes('warning'))) {
          analysis.memory_warnings.push(log);
        } else if (message.includes('disk') && (message.includes('full') || message.includes('space'))) {
          analysis.disk_warnings.push(log);
        } else if (message.includes('cpu') && message.includes('high')) {
          analysis.cpu_warnings.push(log);
        } else if (message.includes('exhausted') || message.includes('limit')) {
          analysis.resource_exhaustion.push(log);
        }
      });
    }
    
    return analysis;
  }
  
  generateRecommendations(analysis) {
    const recommendations = [];
    
    // System health recommendations
    if (analysis.system_health.status === 'critical') {
      recommendations.push({
        priority: 'high',
        category: 'system',
        title: 'Critical System Issues Detected',
        description: `${analysis.system_health.critical_issues.length} critical system issues found`,
        action: 'Restart cluster or contact support immediately'
      });
    }
    
    // GPU health recommendations
    if (analysis.gpu_health.cuda_errors.length > 0) {
      recommendations.push({
        priority: 'high',
        category: 'gpu',
        title: 'CUDA Errors Detected',
        description: `${analysis.gpu_health.cuda_errors.length} CUDA errors found`,
        action: 'Check CUDA version compatibility and driver installation'
      });
    }
    
    if (analysis.gpu_health.memory_errors.length > 0) {
      recommendations.push({
        priority: 'medium',
        category: 'gpu',
        title: 'GPU Memory Issues',
        description: `${analysis.gpu_health.memory_errors.length} memory errors found`,
        action: 'Reduce batch size or model complexity, or upgrade to higher memory GPU'
      });
    }
    
    // Application health recommendations
    if (analysis.application_health.total_errors > 20) {
      recommendations.push({
        priority: 'medium',
        category: 'application',
        title: 'High Application Error Rate',
        description: `${analysis.application_health.total_errors} application errors in last 2 hours`,
        action: 'Review application logs and fix recurring issues'
      });
    }
    
    return recommendations;
  }
  
  async searchLogs(query, timeRange) {
    const response = await fetch(
      `https://api.tensorone.ai/v1/clusters/${this.clusterId}/logs/search`,
      {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer YOUR_API_KEY',
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          query: query,
          time_range: timeRange,
          limit: 100
        })
      }
    );
    
    return await response.json();
  }
  
  async generateDebugReport() {
    const report = await this.performComprehensiveDebug();
    
    console.log('\n=== Cluster Debug Report ===');
    console.log(`Cluster: ${report.cluster_id}`);
    console.log(`Analysis Time: ${report.debug_timestamp}`);
    
    // Overall health summary
    const healthScores = [
      report.analysis.system_health.health_score,
      report.analysis.gpu_health.health_score,
      report.analysis.application_health.health_score,
      report.analysis.network_health.health_score
    ];
    
    const overallHealth = healthScores.reduce((sum, score) => sum + score, 0) / healthScores.length;
    
    console.log(`\n📊 Overall Health Score: ${overallHealth.toFixed(1)}/100`);
    
    // Component health
    console.log('\n🔧 Component Health:');
    console.log(`  System: ${report.analysis.system_health.status} (${report.analysis.system_health.health_score}/100)`);
    console.log(`  GPU: ${report.analysis.gpu_health.status} (${report.analysis.gpu_health.health_score}/100)`);
    console.log(`  Application: ${report.analysis.application_health.status} (${report.analysis.application_health.health_score}/100)`);
    console.log(`  Network: ${report.analysis.network_health.status} (${report.analysis.network_health.health_score}/100)`);
    
    // Recommendations
    if (report.recommendations.length > 0) {
      console.log('\n💡 Recommendations:');
      report.recommendations.forEach((rec, index) => {
        console.log(`  ${index + 1}. [${rec.priority.toUpperCase()}] ${rec.title}`);
        console.log(`     ${rec.description}`);
        console.log(`     Action: ${rec.action}`);
      });
    } else {
      console.log('\n✅ No critical issues found');
    }
    
    return report;
  }
}

// Usage
const debugger = new ClusterDebugger('cluster_abc123');
await debugger.generateDebugReport();

Error Handling

{
  "success": false,
  "error": {
    "code": "LOGS_NOT_AVAILABLE",
    "message": "Logs are not available for the specified time range",
    "details": {
      "requested_start": "2024-01-01T00:00:00Z",
      "requested_end": "2024-01-02T00:00:00Z",
      "available_start": "2024-01-14T00:00:00Z",
      "log_retention_days": 30
    }
  }
}

Best Practices

  1. Log Level Management: Use appropriate log levels to reduce noise
  2. Time Range Optimization: Use specific time ranges for better performance
  3. Search Efficiency: Use structured queries for complex log analysis
  4. Regular Monitoring: Set up automated log monitoring for critical issues
  5. Log Retention: Understand log retention policies for your use case
  6. Export Strategy: Regular export of important logs for compliance and analysis

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Path Parameters

cluster_id
string
required

Query Parameters

lines
integer
default:100

Number of log lines to retrieve

Required range: x <= 1000

Response

200 - application/json

Cluster logs

The response is of type object.