Overview
Cluster Logs provide comprehensive logging capabilities for GPU clusters including system logs, application logs, error logs, and audit trails. Essential for debugging issues, monitoring application behavior, and maintaining audit compliance.Endpoints
Get Logs
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs
Stream Logs
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/stream
Search Logs
Copy
POST https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/search
Download Logs
Copy
GET https://api.tensorone.ai/v1/clusters/{cluster_id}/logs/download
Get Logs
Query Parameters
Parameter | Type | Required | Description |
---|---|---|---|
log_type | string | No | Log type: system , application , error , audit , docker , gpu |
level | string | No | Log level: debug , info , warn , error , fatal |
start_time | string | No | Start time (ISO 8601 format) |
end_time | string | No | End time (ISO 8601 format) |
limit | integer | No | Maximum number of log entries (default: 100, max: 1000) |
offset | integer | No | Number of entries to skip (for pagination) |
search | string | No | Search term to filter logs |
source | string | No | Log source: container , host , gpu_driver , application |
format | string | No | Response format: json , text (default: json ) |
tail | boolean | No | Get most recent logs first (default: true) |
Log Types
Type | Description | Sources |
---|---|---|
system | System-level logs (kernel, services, drivers) | syslog, systemd, kernel |
application | Application and user process logs | stdout, stderr, app logs |
error | Error and exception logs | error handlers, crash reports |
audit | Security and access audit logs | auth, API access, file access |
docker | Container runtime logs | Docker daemon, container logs |
gpu | GPU driver and CUDA logs | nvidia-smi, CUDA runtime |
Request Examples
Copy
# Get recent application logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=application&limit=50" \
-H "Authorization: Bearer YOUR_API_KEY"
# Get error logs from last hour
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?log_type=error&level=error&start_time=2024-01-15T15:00:00Z" \
-H "Authorization: Bearer YOUR_API_KEY"
# Search for specific terms in logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs?search=CUDA&log_type=gpu" \
-H "Authorization: Bearer YOUR_API_KEY"
Log Search
Advanced log search capabilities for complex queries and filtering.Copy
# Search logs with complex query
curl -X POST "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "level:error AND (CUDA OR memory)",
"time_range": {
"start": "2024-01-15T00:00:00Z",
"end": "2024-01-15T23:59:59Z"
},
"limit": 100,
"sort": "timestamp desc"
}'
Response Schema
Copy
{
"success": true,
"data": {
"cluster_id": "cluster_abc123",
"logs": [
{
"id": "log_entry_123",
"timestamp": "2024-01-15T16:45:23.456Z",
"level": "info",
"source": "application",
"log_type": "application",
"message": "Training step 1500: loss=0.234, accuracy=94.2%",
"process": {
"pid": 1234,
"name": "python",
"command": "python train.py --model gpt2"
},
"metadata": {
"host": "gpu-node-01",
"container_id": "abc123def456",
"thread_id": "MainThread",
"file": "train.py",
"line": 142
},
"tags": ["training", "metrics"]
},
{
"id": "log_entry_124",
"timestamp": "2024-01-15T16:45:20.123Z",
"level": "error",
"source": "gpu_driver",
"log_type": "gpu",
"message": "CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 79.20 GiB total capacity)",
"process": {
"pid": 1234,
"name": "python"
},
"metadata": {
"host": "gpu-node-01",
"gpu_id": 0,
"cuda_version": "12.1",
"driver_version": "530.30.02"
},
"tags": ["cuda", "memory", "error"],
"stack_trace": [
"File \"train.py\", line 89, in forward",
"return self.model(inputs)",
"RuntimeError: CUDA out of memory"
]
}
],
"pagination": {
"total_count": 15420,
"current_page": 1,
"total_pages": 155,
"has_next": true,
"has_previous": false
},
"query_info": {
"log_type": "application",
"time_range": {
"start": "2024-01-15T16:00:00Z",
"end": "2024-01-15T17:00:00Z"
},
"filters_applied": ["log_type", "time_range"],
"search_terms": null
}
},
"meta": {
"request_id": "req_logs_456",
"query_time_ms": 127,
"log_retention_days": 30
}
}
Download Logs
Export logs for offline analysis or archival purposes.Copy
# Download logs as compressed archive
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123/logs/download?format=json&compression=gzip&start_time=2024-01-15T00:00:00Z&end_time=2024-01-15T23:59:59Z" \
-H "Authorization: Bearer YOUR_API_KEY" \
--output cluster_logs.json.gz
Use Cases
Training Progress Monitoring
Monitor ML training progress through application logs.Copy
class TrainingProgressMonitor:
def __init__(self, cluster_id):
self.cluster_id = cluster_id
self.progress_history = []
self.alert_thresholds = {
"loss_stagnation_steps": 100,
"error_rate_threshold": 0.1,
"memory_usage_threshold": 0.95
}
def monitor_training_session(self, duration_minutes=60, check_interval=30):
"""Monitor training session with periodic checks"""
import time, threading
self.monitoring = True
end_time = time.time() + (duration_minutes * 60)
def monitoring_loop():
while self.monitoring and time.time() < end_time:
try:
self.check_training_progress()
time.sleep(check_interval)
except Exception as e:
print(f"Monitoring error: {e}")
time.sleep(check_interval)
monitor_thread = threading.Thread(target=monitoring_loop)
monitor_thread.daemon = True
monitor_thread.start()
return monitor_thread
def check_training_progress(self):
"""Check current training progress from logs"""
# Get recent application logs
logs = get_cluster_logs(
self.cluster_id,
log_type="application",
limit=50,
search="loss"
)
if not logs["success"]:
return
current_progress = {
"timestamp": time.time(),
"loss_values": [],
"accuracy_values": [],
"error_count": 0,
"memory_warnings": 0
}
for log_entry in logs["data"]["logs"]:
message = log_entry["message"]
# Extract loss values
loss_match = re.search(r'loss[:\s=]+([0-9.]+)', message, re.IGNORECASE)
if loss_match:
current_progress["loss_values"].append(float(loss_match.group(1)))
# Extract accuracy values
acc_match = re.search(r'acc(?:uracy)?[:\s=]+([0-9.]+)', message, re.IGNORECASE)
if acc_match:
current_progress["accuracy_values"].append(float(acc_match.group(1)))
# Count errors
if log_entry["level"] in ["error", "fatal"]:
current_progress["error_count"] += 1
# Check for memory warnings
if "memory" in message.lower() and "warning" in log_entry["level"]:
current_progress["memory_warnings"] += 1
self.progress_history.append(current_progress)
# Check for alerts
self.check_training_alerts(current_progress)
# Print progress summary
if current_progress["loss_values"]:
avg_loss = sum(current_progress["loss_values"]) / len(current_progress["loss_values"])
print(f"[{datetime.now().strftime('%H:%M:%S')}] Avg Loss: {avg_loss:.4f}, Errors: {current_progress['error_count']}")
def check_training_alerts(self, current_progress):
"""Check for training issues and generate alerts"""
alerts = []
# Check loss stagnation
if len(self.progress_history) >= 3:
recent_losses = []
for progress in self.progress_history[-3:]:
if progress["loss_values"]:
recent_losses.extend(progress["loss_values"])
if len(recent_losses) >= 5:
loss_variance = statistics.variance(recent_losses)
if loss_variance < 0.0001: # Very low variance indicates stagnation
alerts.append("Loss appears to be stagnating")
# Check error rate
total_errors = sum(p["error_count"] for p in self.progress_history[-5:])
if total_errors > self.alert_thresholds["error_rate_threshold"] * 50: # Assuming 50 logs checked
alerts.append(f"High error rate detected: {total_errors} errors in recent logs")
# Check memory warnings
if current_progress["memory_warnings"] > 0:
alerts.append(f"Memory warnings detected: {current_progress['memory_warnings']}")
if alerts:
print(f"🚨 TRAINING ALERTS:")
for alert in alerts:
print(f" • {alert}")
def stop_monitoring(self):
"""Stop the monitoring process"""
self.monitoring = False
def get_training_summary(self):
"""Get comprehensive training summary"""
if not self.progress_history:
return {"error": "No progress data available"}
all_losses = []
all_accuracies = []
total_errors = 0
for progress in self.progress_history:
all_losses.extend(progress["loss_values"])
all_accuracies.extend(progress["accuracy_values"])
total_errors += progress["error_count"]
summary = {
"monitoring_duration_minutes": (time.time() - self.progress_history[0]["timestamp"]) / 60,
"total_progress_checks": len(self.progress_history),
"loss_analysis": {},
"accuracy_analysis": {},
"error_analysis": {
"total_errors": total_errors,
"error_rate": total_errors / len(self.progress_history)
}
}
if all_losses:
summary["loss_analysis"] = {
"first_loss": all_losses[-1],
"last_loss": all_losses[0],
"min_loss": min(all_losses),
"avg_loss": sum(all_losses) / len(all_losses),
"loss_improvement": all_losses[-1] - all_losses[0],
"loss_trend": "improving" if all_losses[0] < all_losses[-1] else "degrading"
}
if all_accuracies:
summary["accuracy_analysis"] = {
"first_accuracy": all_accuracies[-1],
"last_accuracy": all_accuracies[0],
"max_accuracy": max(all_accuracies),
"avg_accuracy": sum(all_accuracies) / len(all_accuracies),
"accuracy_improvement": all_accuracies[0] - all_accuracies[-1]
}
return summary
# Usage
monitor = TrainingProgressMonitor("cluster_training_001")
monitor_thread = monitor.monitor_training_session(duration_minutes=120, check_interval=60)
# Get summary after training
summary = monitor.get_training_summary()
print("Training Summary:", json.dumps(summary, indent=2))
Error Analysis and Debugging
Comprehensive error analysis for troubleshooting cluster issues.Copy
class ClusterDebugger {
constructor(clusterId) {
this.clusterId = clusterId;
}
async performComprehensiveDebug() {
console.log(`🔍 Starting comprehensive debug for cluster ${this.clusterId}`);
const debugReport = {
cluster_id: this.clusterId,
debug_timestamp: new Date().toISOString(),
analysis: {
system_health: await this.analyzeSystemHealth(),
gpu_health: await this.analyzeGPUHealth(),
application_health: await this.analyzeApplicationHealth(),
network_health: await this.analyzeNetworkHealth(),
resource_usage: await this.analyzeResourceUsage()
},
recommendations: []
};
// Generate recommendations based on findings
debugReport.recommendations = this.generateRecommendations(debugReport.analysis);
return debugReport;
}
async analyzeSystemHealth() {
const systemLogs = await getClusterLogs(this.clusterId, {
log_type: 'system',
level: 'error',
limit: 100,
start_time: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString() // Last 2 hours
});
const analysis = {
error_count: 0,
critical_issues: [],
kernel_issues: [],
service_issues: [],
health_score: 100
};
if (systemLogs.success) {
analysis.error_count = systemLogs.data.logs.length;
systemLogs.data.logs.forEach(log => {
const message = log.message.toLowerCase();
if (message.includes('kernel') || message.includes('panic')) {
analysis.kernel_issues.push(log);
analysis.health_score -= 20;
} else if (message.includes('service') || message.includes('daemon')) {
analysis.service_issues.push(log);
analysis.health_score -= 10;
} else if (message.includes('critical') || message.includes('fatal')) {
analysis.critical_issues.push(log);
analysis.health_score -= 15;
}
});
}
analysis.health_score = Math.max(0, analysis.health_score);
analysis.status = analysis.health_score > 80 ? 'healthy' :
analysis.health_score > 50 ? 'degraded' : 'critical';
return analysis;
}
async analyzeGPUHealth() {
const gpuLogs = await getClusterLogs(this.clusterId, {
log_type: 'gpu',
limit: 200,
start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString() // Last hour
});
const analysis = {
cuda_errors: [],
memory_errors: [],
temperature_warnings: [],
driver_issues: [],
health_score: 100
};
if (gpuLogs.success) {
gpuLogs.data.logs.forEach(log => {
const message = log.message.toLowerCase();
if (message.includes('cuda') && message.includes('error')) {
analysis.cuda_errors.push(log);
analysis.health_score -= 15;
} else if (message.includes('out of memory') || message.includes('oom')) {
analysis.memory_errors.push(log);
analysis.health_score -= 20;
} else if (message.includes('temperature') && message.includes('high')) {
analysis.temperature_warnings.push(log);
analysis.health_score -= 10;
} else if (message.includes('driver')) {
analysis.driver_issues.push(log);
analysis.health_score -= 10;
}
});
}
analysis.health_score = Math.max(0, analysis.health_score);
analysis.status = analysis.health_score > 80 ? 'healthy' :
analysis.health_score > 50 ? 'degraded' : 'critical';
return analysis;
}
async analyzeApplicationHealth() {
// Search for application errors and exceptions
const searchResults = await this.searchLogs(
'level:error AND source:application',
{
start: new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString(),
end: new Date().toISOString()
}
);
const analysis = {
total_errors: 0,
exception_types: {},
frequent_errors: {},
performance_issues: [],
health_score: 100
};
if (searchResults.success) {
analysis.total_errors = searchResults.data.total_matches;
searchResults.data.logs.forEach(log => {
const message = log.message;
// Categorize exceptions
const exceptionMatch = message.match(/(\w+Error|\w+Exception)/);
if (exceptionMatch) {
const exceptionType = exceptionMatch[1];
analysis.exception_types[exceptionType] =
(analysis.exception_types[exceptionType] || 0) + 1;
}
// Look for performance issues
if (message.includes('timeout') || message.includes('slow') ||
message.includes('performance')) {
analysis.performance_issues.push(log);
}
// Count frequent error patterns
const errorPattern = message.substring(0, 50);
analysis.frequent_errors[errorPattern] =
(analysis.frequent_errors[errorPattern] || 0) + 1;
});
// Adjust health score based on error frequency
if (analysis.total_errors > 50) {
analysis.health_score -= 30;
} else if (analysis.total_errors > 20) {
analysis.health_score -= 20;
} else if (analysis.total_errors > 10) {
analysis.health_score -= 10;
}
}
analysis.health_score = Math.max(0, analysis.health_score);
analysis.status = analysis.health_score > 80 ? 'healthy' :
analysis.health_score > 50 ? 'degraded' : 'critical';
return analysis;
}
async analyzeNetworkHealth() {
const networkLogs = await getClusterLogs(this.clusterId, {
search: 'network',
level: 'error',
limit: 50,
start_time: new Date(Date.now() - 60 * 60 * 1000).toISOString()
});
const analysis = {
connection_errors: [],
timeout_errors: [],
dns_issues: [],
health_score: 100
};
if (networkLogs.success) {
networkLogs.data.logs.forEach(log => {
const message = log.message.toLowerCase();
if (message.includes('connection') && message.includes('error')) {
analysis.connection_errors.push(log);
analysis.health_score -= 10;
} else if (message.includes('timeout')) {
analysis.timeout_errors.push(log);
analysis.health_score -= 5;
} else if (message.includes('dns') || message.includes('resolve')) {
analysis.dns_issues.push(log);
analysis.health_score -= 5;
}
});
}
analysis.health_score = Math.max(0, analysis.health_score);
analysis.status = analysis.health_score > 90 ? 'healthy' :
analysis.health_score > 70 ? 'degraded' : 'critical';
return analysis;
}
async analyzeResourceUsage() {
// Look for resource-related log messages
const resourceLogs = await getClusterLogs(this.clusterId, {
search: 'memory OR disk OR cpu',
limit: 100,
start_time: new Date(Date.now() - 30 * 60 * 1000).toISOString() // Last 30 minutes
});
const analysis = {
memory_warnings: [],
disk_warnings: [],
cpu_warnings: [],
resource_exhaustion: []
};
if (resourceLogs.success) {
resourceLogs.data.logs.forEach(log => {
const message = log.message.toLowerCase();
if (message.includes('memory') && (message.includes('low') || message.includes('warning'))) {
analysis.memory_warnings.push(log);
} else if (message.includes('disk') && (message.includes('full') || message.includes('space'))) {
analysis.disk_warnings.push(log);
} else if (message.includes('cpu') && message.includes('high')) {
analysis.cpu_warnings.push(log);
} else if (message.includes('exhausted') || message.includes('limit')) {
analysis.resource_exhaustion.push(log);
}
});
}
return analysis;
}
generateRecommendations(analysis) {
const recommendations = [];
// System health recommendations
if (analysis.system_health.status === 'critical') {
recommendations.push({
priority: 'high',
category: 'system',
title: 'Critical System Issues Detected',
description: `${analysis.system_health.critical_issues.length} critical system issues found`,
action: 'Restart cluster or contact support immediately'
});
}
// GPU health recommendations
if (analysis.gpu_health.cuda_errors.length > 0) {
recommendations.push({
priority: 'high',
category: 'gpu',
title: 'CUDA Errors Detected',
description: `${analysis.gpu_health.cuda_errors.length} CUDA errors found`,
action: 'Check CUDA version compatibility and driver installation'
});
}
if (analysis.gpu_health.memory_errors.length > 0) {
recommendations.push({
priority: 'medium',
category: 'gpu',
title: 'GPU Memory Issues',
description: `${analysis.gpu_health.memory_errors.length} memory errors found`,
action: 'Reduce batch size or model complexity, or upgrade to higher memory GPU'
});
}
// Application health recommendations
if (analysis.application_health.total_errors > 20) {
recommendations.push({
priority: 'medium',
category: 'application',
title: 'High Application Error Rate',
description: `${analysis.application_health.total_errors} application errors in last 2 hours`,
action: 'Review application logs and fix recurring issues'
});
}
return recommendations;
}
async searchLogs(query, timeRange) {
const response = await fetch(
`https://api.tensorone.ai/v1/clusters/${this.clusterId}/logs/search`,
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
query: query,
time_range: timeRange,
limit: 100
})
}
);
return await response.json();
}
async generateDebugReport() {
const report = await this.performComprehensiveDebug();
console.log('\n=== Cluster Debug Report ===');
console.log(`Cluster: ${report.cluster_id}`);
console.log(`Analysis Time: ${report.debug_timestamp}`);
// Overall health summary
const healthScores = [
report.analysis.system_health.health_score,
report.analysis.gpu_health.health_score,
report.analysis.application_health.health_score,
report.analysis.network_health.health_score
];
const overallHealth = healthScores.reduce((sum, score) => sum + score, 0) / healthScores.length;
console.log(`\n📊 Overall Health Score: ${overallHealth.toFixed(1)}/100`);
// Component health
console.log('\n🔧 Component Health:');
console.log(` System: ${report.analysis.system_health.status} (${report.analysis.system_health.health_score}/100)`);
console.log(` GPU: ${report.analysis.gpu_health.status} (${report.analysis.gpu_health.health_score}/100)`);
console.log(` Application: ${report.analysis.application_health.status} (${report.analysis.application_health.health_score}/100)`);
console.log(` Network: ${report.analysis.network_health.status} (${report.analysis.network_health.health_score}/100)`);
// Recommendations
if (report.recommendations.length > 0) {
console.log('\n💡 Recommendations:');
report.recommendations.forEach((rec, index) => {
console.log(` ${index + 1}. [${rec.priority.toUpperCase()}] ${rec.title}`);
console.log(` ${rec.description}`);
console.log(` Action: ${rec.action}`);
});
} else {
console.log('\n✅ No critical issues found');
}
return report;
}
}
// Usage
const debugger = new ClusterDebugger('cluster_abc123');
await debugger.generateDebugReport();
Error Handling
Copy
{
"success": false,
"error": {
"code": "LOGS_NOT_AVAILABLE",
"message": "Logs are not available for the specified time range",
"details": {
"requested_start": "2024-01-01T00:00:00Z",
"requested_end": "2024-01-02T00:00:00Z",
"available_start": "2024-01-14T00:00:00Z",
"log_retention_days": 30
}
}
}
Best Practices
- Log Level Management: Use appropriate log levels to reduce noise
- Time Range Optimization: Use specific time ranges for better performance
- Search Efficiency: Use structured queries for complex log analysis
- Regular Monitoring: Set up automated log monitoring for critical issues
- Log Retention: Understand log retention policies for your use case
- Export Strategy: Regular export of important logs for compliance and analysis
Authorizations
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Path Parameters
Query Parameters
Number of log lines to retrieve
Required range:
x <= 1000
Response
200 - application/json
Cluster logs
The response is of type object
.