Performance Monitoring

Track comprehensive performance metrics for GPU clusters, serverless endpoints, and platform services. Essential for optimization, capacity planning, and maintaining optimal system performance.

Request Parameters

resource

string

default:"all"

Resource type to monitor:

all - All resources and services
clusters - GPU clusters only
endpoints - Serverless endpoints only
training - Training jobs and services
ai-services - AI generation services
storage - Storage systems
network - Network infrastructure

timeRange

string

default:"1h"

Time range for metrics:

5m - Last 5 minutes
15m - Last 15 minutes
1h - Last hour
6h - Last 6 hours
24h - Last 24 hours
7d - Last 7 days
30d - Last 30 days

granularity

string

default:"1m"

Data point granularity:

10s - 10-second intervals
1m - 1-minute intervals
5m - 5-minute intervals
15m - 15-minute intervals
1h - 1-hour intervals
1d - Daily aggregation

metrics

array

Specific metrics to include:

cpu_utilization - CPU usage percentage
memory_utilization - Memory usage percentage
gpu_utilization - GPU usage percentage
disk_io - Disk I/O operations and throughput
network_io - Network I/O operations and throughput
response_time - API response times
throughput - Request throughput
error_rate - Error rates and failure counts
queue_depth - Job queue depths

resourceIds

array

Specific resource IDs to monitor (cluster IDs, endpoint IDs, etc.)

regions

array

Specific regions to include:

us-east-1 - US East (Virginia)
us-west-2 - US West (Oregon)
eu-west-1 - Europe (Ireland)
ap-southeast-1 - Asia Pacific (Singapore)

aggregation

string

default:"avg"

Aggregation method for data points:

avg - Average values
max - Maximum values
min - Minimum values
sum - Sum of values
p95 - 95th percentile
p99 - 99th percentile

Response

timeRange

object

Time range information

Show Time Range

start

string

Start timestamp for the data range (ISO 8601)

end

string

End timestamp for the data range (ISO 8601)

granularity

string

Data point granularity used

overall

object

Overall platform performance summary

Show Overall Performance

healthScore

number

Overall system health score (0-100)

performanceScore

number

Overall performance score (0-100)

availabilityScore

number

System availability score (0-100)

averageResponseTime

number

Average response time across all services (ms)

totalThroughput

number

Total requests per second across platform

errorRate

number

Overall error rate percentage

resources

array

Performance metrics by resource type

Show Resource Performance

resourceType

string

Type of resource (clusters, endpoints, etc.)

resourceId

string

Unique identifier for the resource

resourceName

string

Human-readable name of the resource

region

string

Region where resource is located

metrics

object

Performance metrics for this resource

Show Resource Metrics

cpu

object

CPU utilization metrics

memory

object

Memory utilization metrics

gpu

object

GPU utilization metrics (if applicable)

network

object

Network I/O metrics

storage

object

Storage I/O metrics

timeSeries

array

Time series data points

Show Time Series Data

timestamp

string

Data point timestamp

values

object

Metric values at this timestamp

aggregatedMetrics

object

Platform-wide aggregated metrics

Show Aggregated Metrics

cpuUtilization

object

CPU usage statistics across all resources

memoryUtilization

object

Memory usage statistics across all resources

gpuUtilization

object

GPU usage statistics across all GPU resources

networkThroughput

object

Network traffic statistics

storageThroughput

object

Storage I/O statistics

apiMetrics

object

API performance statistics

alerts

array

Performance-related alerts and anomalies

Show Performance Alert

alertId

string

Unique alert identifier

severity

string

Alert severity: low, medium, high, critical

metric

string

Metric that triggered the alert

threshold

number

Threshold value that was exceeded

currentValue

number

Current value of the metric

resource

string

Resource that triggered the alert

timestamp

string

When the alert was triggered

Example

cURL

curl -X GET "https://api.tensorone.ai/v2/monitoring/performance" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "resource=clusters" \
  -d "timeRange=6h" \
  -d "granularity=15m" \
  -d "metrics[]=cpu_utilization&metrics[]=gpu_utilization&metrics[]=memory_utilization" \
  -d "aggregation=avg"

Python

import requests
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import pandas as pd

def get_performance_metrics(resource="all", time_range="1h", metrics=None):
    params = {
        'resource': resource,
        'timeRange': time_range,
        'granularity': '1m',
        'aggregation': 'avg'
    }
    
    if metrics:
        params['metrics'] = metrics
    
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/performance",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params=params
    )
    
    return response.json()

# Get comprehensive performance data
performance_data = get_performance_metrics(
    resource="clusters",
    time_range="6h",
    metrics=["cpu_utilization", "gpu_utilization", "memory_utilization", "response_time"]
)

print("Overall Performance:")
overall = performance_data['overall']
print(f"Health Score: {overall['healthScore']}/100")
print(f"Performance Score: {overall['performanceScore']}/100")
print(f"Average Response Time: {overall['averageResponseTime']}ms")
print(f"Error Rate: {overall['errorRate']}%")

# Analyze resource performance
print("\nTop Resource Performance Issues:")
for resource in performance_data['resources']:
    metrics = resource['metrics']
    if metrics.get('cpu', {}).get('current', 0) > 80:
        print(f"HIGH CPU: {resource['resourceName']} - {metrics['cpu']['current']}%")
    if metrics.get('memory', {}).get('current', 0) > 85:
        print(f"HIGH MEMORY: {resource['resourceName']} - {metrics['memory']['current']}%")
    if metrics.get('gpu', {}).get('current', 0) > 95:
        print(f"HIGH GPU: {resource['resourceName']} - {metrics['gpu']['current']}%")

# Check for performance alerts
if performance_data['alerts']:
    print(f"\nActive Performance Alerts ({len(performance_data['alerts'])}):")
    for alert in performance_data['alerts']:
        print(f"  {alert['severity'].upper()}: {alert['metric']} on {alert['resource']}")
        print(f"    Current: {alert['currentValue']}, Threshold: {alert['threshold']}")

# Plot performance trends
def plot_performance_trends(data):
    """Plot CPU and GPU utilization trends"""
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
    
    for resource in data['resources'][:3]:  # Plot top 3 resources
        time_series = resource['timeSeries']
        timestamps = [datetime.fromisoformat(point['timestamp'].replace('Z', '+00:00')) for point in time_series]
        cpu_values = [point['values'].get('cpu_utilization', 0) for point in time_series]
        gpu_values = [point['values'].get('gpu_utilization', 0) for point in time_series]
        
        ax1.plot(timestamps, cpu_values, label=f"{resource['resourceName']} CPU")
        ax2.plot(timestamps, gpu_values, label=f"{resource['resourceName']} GPU")
    
    ax1.set_title('CPU Utilization Over Time')
    ax1.set_ylabel('CPU Usage (%)')
    ax1.legend()
    ax1.grid(True)
    
    ax2.set_title('GPU Utilization Over Time')
    ax2.set_ylabel('GPU Usage (%)')
    ax2.set_xlabel('Time')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()

# Uncomment to plot trends
# plot_performance_trends(performance_data)

JavaScript

const getPerformanceMetrics = async (options = {}) => {
  const params = new URLSearchParams({
    resource: options.resource || 'all',
    timeRange: options.timeRange || '1h',
    granularity: options.granularity || '1m',
    aggregation: options.aggregation || 'avg'
  });

  if (options.metrics) {
    options.metrics.forEach(metric => params.append('metrics[]', metric));
  }

  const response = await fetch(`https://api.tensorone.ai/v2/monitoring/performance?${params}`, {
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY'
    }
  });

  return await response.json();
};

// Monitor cluster performance
getPerformanceMetrics({
  resource: 'clusters',
  timeRange: '6h',
  metrics: ['cpu_utilization', 'gpu_utilization', 'memory_utilization'],
  granularity: '15m'
}).then(data => {
  console.log('Overall Performance:', data.overall);
  
  // Find performance bottlenecks
  data.resources.forEach(resource => {
    const metrics = resource.metrics;
    const issues = [];
    
    if (metrics.cpu?.current > 80) issues.push(`High CPU: ${metrics.cpu.current}%`);
    if (metrics.memory?.current > 85) issues.push(`High Memory: ${metrics.memory.current}%`);
    if (metrics.gpu?.current > 95) issues.push(`High GPU: ${metrics.gpu.current}%`);
    
    if (issues.length > 0) {
      console.log(`${resource.resourceName} Issues:`, issues);
    }
  });
  
  // Check alerts
  if (data.alerts.length > 0) {
    console.log('Performance Alerts:', data.alerts);
  }
});

{
  "timeRange": {
    "start": "2024-01-16T12:00:00Z",
    "end": "2024-01-16T18:00:00Z",
    "granularity": "15m"
  },
  "overall": {
    "healthScore": 87,
    "performanceScore": 82,
    "availabilityScore": 99.2,
    "averageResponseTime": 245,
    "totalThroughput": 1847.3,
    "errorRate": 0.8
  },
  "resources": [
    {
      "resourceType": "cluster",
      "resourceId": "cluster-gpu-a100-001",
      "resourceName": "GPU Cluster A100-001",
      "region": "us-east-1",
      "metrics": {
        "cpu": {
          "current": 78.5,
          "average": 72.3,
          "peak": 94.2,
          "trend": "increasing"
        },
        "memory": {
          "current": 84.2,
          "average": 79.8,
          "peak": 91.5,
          "trend": "stable"
        },
        "gpu": {
          "current": 92.8,
          "average": 88.4,
          "peak": 98.1,
          "trend": "increasing"
        },
        "network": {
          "inbound": "2.3 Gbps",
          "outbound": "4.7 Gbps",
          "latency": 12.3
        },
        "storage": {
          "readThroughput": "850 MB/s",
          "writeThroughput": "420 MB/s",
          "iops": 12500
        }
      },
      "timeSeries": [
        {
          "timestamp": "2024-01-16T17:45:00Z",
          "values": {
            "cpu_utilization": 78.5,
            "memory_utilization": 84.2,
            "gpu_utilization": 92.8,
            "response_time": 234
          }
        }
      ]
    }
  ],
  "aggregatedMetrics": {
    "cpuUtilization": {
      "average": 68.4,
      "p95": 87.2,
      "p99": 94.8,
      "peakTime": "2024-01-16T15:30:00Z"
    },
    "memoryUtilization": {
      "average": 71.8,
      "p95": 89.3,
      "p99": 95.1,
      "peakTime": "2024-01-16T16:15:00Z"
    },
    "gpuUtilization": {
      "average": 83.2,
      "p95": 96.7,
      "p99": 98.9,
      "peakTime": "2024-01-16T17:00:00Z"
    },
    "networkThroughput": {
      "totalInbound": "45.8 Gbps",
      "totalOutbound": "78.2 Gbps",
      "averageLatency": 15.7
    },
    "storageThroughput": {
      "totalRead": "12.4 GB/s",
      "totalWrite": "6.8 GB/s",
      "averageIOPS": 89500
    },
    "apiMetrics": {
      "averageResponseTime": 245,
      "throughput": 1847.3,
      "successRate": 99.2,
      "errorRate": 0.8
    }
  },
  "alerts": [
    {
      "alertId": "perf-alert-001",
      "severity": "medium",
      "metric": "gpu_utilization",
      "threshold": 90,
      "currentValue": 92.8,
      "resource": "cluster-gpu-a100-001",
      "timestamp": "2024-01-16T17:45:00Z"
    }
  ]
}

Advanced Monitoring

Real-time Performance Dashboard

Create a real-time monitoring dashboard:

Python

import time
import threading
from collections import deque
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

class PerformanceMonitor:
    def __init__(self, api_key, update_interval=30):
        self.api_key = api_key
        self.update_interval = update_interval
        self.running = False
        self.data_history = {
            'timestamps': deque(maxlen=100),
            'cpu': deque(maxlen=100),
            'memory': deque(maxlen=100),
            'gpu': deque(maxlen=100),
            'response_time': deque(maxlen=100)
        }
    
    def fetch_metrics(self):
        """Fetch current performance metrics"""
        response = requests.get(
            "https://api.tensorone.ai/v2/monitoring/performance",
            headers={"Authorization": f"Bearer {self.api_key}"},
            params={
                "resource": "all",
                "timeRange": "5m",
                "granularity": "1m",
                "aggregation": "avg"
            }
        )
        return response.json()
    
    def update_data(self):
        """Update data history with latest metrics"""
        try:
            metrics = self.fetch_metrics()
            now = time.time()
            
            self.data_history['timestamps'].append(now)
            self.data_history['cpu'].append(metrics['aggregatedMetrics']['cpuUtilization']['average'])
            self.data_history['memory'].append(metrics['aggregatedMetrics']['memoryUtilization']['average'])
            self.data_history['gpu'].append(metrics['aggregatedMetrics']['gpuUtilization']['average'])
            self.data_history['response_time'].append(metrics['overall']['averageResponseTime'])
            
            # Print current status
            overall = metrics['overall']
            print(f"[{time.strftime('%H:%M:%S')}] Health: {overall['healthScore']}/100, "
                  f"CPU: {self.data_history['cpu'][-1]:.1f}%, "
                  f"Memory: {self.data_history['memory'][-1]:.1f}%, "
                  f"GPU: {self.data_history['gpu'][-1]:.1f}%")
            
            # Check for alerts
            if metrics['alerts']:
                for alert in metrics['alerts']:
                    print(f"⚠️  ALERT: {alert['metric']} on {alert['resource']} - {alert['currentValue']}")
                    
        except Exception as e:
            print(f"Error fetching metrics: {e}")
    
    def start_monitoring(self):
        """Start continuous monitoring"""
        self.running = True
        print("Starting performance monitoring...")
        
        while self.running:
            self.update_data()
            time.sleep(self.update_interval)
    
    def stop_monitoring(self):
        """Stop monitoring"""
        self.running = False
        print("Monitoring stopped")
    
    def create_dashboard(self):
        """Create real-time dashboard"""
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
        
        def animate(frame):
            if len(self.data_history['timestamps']) > 1:
                times = list(self.data_history['timestamps'])
                
                # Clear and plot
                ax1.clear()
                ax1.plot(times, list(self.data_history['cpu']), 'b-', label='CPU')
                ax1.set_title('CPU Utilization (%)')
                ax1.set_ylim(0, 100)
                
                ax2.clear()
                ax2.plot(times, list(self.data_history['memory']), 'g-', label='Memory')
                ax2.set_title('Memory Utilization (%)')
                ax2.set_ylim(0, 100)
                
                ax3.clear()
                ax3.plot(times, list(self.data_history['gpu']), 'r-', label='GPU')
                ax3.set_title('GPU Utilization (%)')
                ax3.set_ylim(0, 100)
                
                ax4.clear()
                ax4.plot(times, list(self.data_history['response_time']), 'm-', label='Response Time')
                ax4.set_title('Response Time (ms)')
        
        ani = FuncAnimation(fig, animate, interval=1000, cache_frame_data=False)
        plt.tight_layout()
        plt.show()

# Usage
monitor = PerformanceMonitor("YOUR_API_KEY", update_interval=30)

# Start monitoring in background
monitor_thread = threading.Thread(target=monitor.start_monitoring)
monitor_thread.daemon = True
monitor_thread.start()

# Show dashboard (uncomment to run)
# monitor.create_dashboard()

Performance Anomaly Detection

Detect unusual performance patterns:

Python

import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

class AnomalyDetector:
    def __init__(self, api_key):
        self.api_key = api_key
        self.baseline_data = {}
    
    def establish_baseline(self, days=7):
        """Establish performance baseline over specified days"""
        response = requests.get(
            "https://api.tensorone.ai/v2/monitoring/performance",
            headers={"Authorization": f"Bearer {self.api_key}"},
            params={
                "resource": "all",
                "timeRange": f"{days}d",
                "granularity": "1h",
                "aggregation": "avg"
            }
        )
        
        data = response.json()
        
        # Extract baseline metrics
        cpu_values = []
        memory_values = []
        gpu_values = []
        response_times = []
        
        for resource in data['resources']:
            for point in resource['timeSeries']:
                values = point['values']
                cpu_values.append(values.get('cpu_utilization', 0))
                memory_values.append(values.get('memory_utilization', 0))
                gpu_values.append(values.get('gpu_utilization', 0))
                response_times.append(values.get('response_time', 0))
        
        self.baseline_data = {
            'cpu': {'mean': np.mean(cpu_values), 'std': np.std(cpu_values)},
            'memory': {'mean': np.mean(memory_values), 'std': np.std(memory_values)},
            'gpu': {'mean': np.mean(gpu_values), 'std': np.std(gpu_values)},
            'response_time': {'mean': np.mean(response_times), 'std': np.std(response_times)}
        }
        
        print("Baseline established:")
        for metric, stats in self.baseline_data.items():
            print(f"  {metric}: μ={stats['mean']:.1f}, σ={stats['std']:.1f}")
    
    def detect_anomalies(self, threshold=2.0):
        """Detect performance anomalies using z-score"""
        response = requests.get(
            "https://api.tensorone.ai/v2/monitoring/performance",
            headers={"Authorization": f"Bearer {self.api_key}"},
            params={
                "resource": "all",
                "timeRange": "1h",
                "granularity": "5m",
                "aggregation": "avg"
            }
        )
        
        data = response.json()
        anomalies = []
        
        for resource in data['resources']:
            resource_anomalies = []
            
            for point in resource['timeSeries']:
                values = point['values']
                timestamp = point['timestamp']
                
                for metric in ['cpu_utilization', 'memory_utilization', 'gpu_utilization', 'response_time']:
                    if metric.replace('_utilization', '').replace('_', '_') in self.baseline_data:
                        current_value = values.get(metric, 0)
                        baseline_key = metric.replace('_utilization', '').replace('_', '_')
                        baseline = self.baseline_data[baseline_key]
                        
                        # Calculate z-score
                        if baseline['std'] > 0:
                            z_score = abs(current_value - baseline['mean']) / baseline['std']
                            
                            if z_score > threshold:
                                resource_anomalies.append({
                                    'timestamp': timestamp,
                                    'metric': metric,
                                    'value': current_value,
                                    'baseline_mean': baseline['mean'],
                                    'z_score': z_score,
                                    'severity': 'high' if z_score > 3 else 'medium'
                                })
            
            if resource_anomalies:
                anomalies.append({
                    'resource': resource['resourceName'],
                    'resource_id': resource['resourceId'],
                    'anomalies': resource_anomalies
                })
        
        return anomalies
    
    def generate_report(self, anomalies):
        """Generate anomaly detection report"""
        if not anomalies:
            print("✅ No performance anomalies detected")
            return
        
        print(f"🚨 Detected {len(anomalies)} resources with performance anomalies:")
        
        for resource_anomaly in anomalies:
            print(f"\n📊 {resource_anomaly['resource']} ({resource_anomaly['resource_id']}):")
            
            for anomaly in resource_anomaly['anomalies']:
                severity_icon = "🔴" if anomaly['severity'] == 'high' else "🟡"
                print(f"  {severity_icon} {anomaly['metric']}: {anomaly['value']:.1f} "
                      f"(baseline: {anomaly['baseline_mean']:.1f}, z-score: {anomaly['z_score']:.2f})")
                print(f"     Time: {anomaly['timestamp']}")

# Usage
detector = AnomalyDetector("YOUR_API_KEY")

# Establish baseline
detector.establish_baseline(days=7)

# Detect current anomalies
anomalies = detector.detect_anomalies(threshold=2.0)
detector.generate_report(anomalies)

Performance Optimization Recommendations

Get AI-powered optimization suggestions:

Python

def get_optimization_recommendations(resource_id=None):
    """Get performance optimization recommendations"""
    params = {
        'analyzeBottlenecks': True,
        'includeRecommendations': True,
        'timeRange': '24h'
    }
    
    if resource_id:
        params['resourceIds'] = [resource_id]
    
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/performance/optimization",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params=params
    )
    
    return response.json()

def analyze_performance_bottlenecks():
    """Analyze and provide optimization recommendations"""
    recommendations = get_optimization_recommendations()
    
    print("🔧 Performance Optimization Recommendations:")
    print("=" * 50)
    
    for resource in recommendations['resources']:
        print(f"\n📈 {resource['resourceName']} ({resource['resourceType']}):")
        
        # Resource-specific recommendations
        if 'recommendations' in resource:
            for rec in resource['recommendations']:
                priority_icon = {"high": "🔴", "medium": "🟡", "low": "🟢"}.get(rec['priority'], "⚪")
                print(f"  {priority_icon} {rec['title']}")
                print(f"     {rec['description']}")
                if 'estimatedImprovement' in rec:
                    print(f"     💡 Expected improvement: {rec['estimatedImprovement']}")
                print()
        
        # Bottleneck analysis
        if 'bottlenecks' in resource:
            print("  🚧 Identified Bottlenecks:")
            for bottleneck in resource['bottlenecks']:
                print(f"    - {bottleneck['component']}: {bottleneck['description']}")
                print(f"      Impact: {bottleneck['impact']}")
    
    # Platform-wide recommendations
    if 'platformRecommendations' in recommendations:
        print("\n🌐 Platform-wide Recommendations:")
        for rec in recommendations['platformRecommendations']:
            print(f"  • {rec['title']}")
            print(f"    {rec['description']}")
            if 'costImpact' in rec:
                print(f"    💰 Cost impact: {rec['costImpact']}")

# Run optimization analysis
analyze_performance_bottlenecks()

Best Practices

Monitoring Strategy

Granularity: Use appropriate time granularity for your monitoring needs
Baseline Establishment: Establish performance baselines for anomaly detection
Alert Thresholds: Set meaningful thresholds based on historical data
Resource-Specific Monitoring: Monitor different resource types with appropriate metrics

Performance Optimization

Regular Analysis: Review performance metrics regularly for optimization opportunities
Bottleneck Identification: Focus on the most constrained resources first
Capacity Planning: Use trends to predict future resource needs
Cost Optimization: Balance performance with cost considerations

Data Retention

Historical Data: Keep sufficient historical data for trend analysis
Aggregation: Use appropriate aggregation for long-term storage
Archive Strategy: Archive old detailed metrics while keeping summaries
Compliance: Ensure data retention meets compliance requirements

Performance metrics are updated every 30 seconds for real-time monitoring. Historical data is available for up to 90 days at full granularity.

Use performance baselines and anomaly detection to proactively identify issues before they impact users. Set up automated alerts for critical performance thresholds.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Performance Monitoring

Request Parameters

Response

Example

Advanced Monitoring

Real-time Performance Dashboard

Performance Anomaly Detection

Performance Optimization Recommendations

Best Practices

Monitoring Strategy

Performance Optimization

Data Retention

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Request Parameters

​Response

​Example

​Advanced Monitoring

​Real-time Performance Dashboard

​Performance Anomaly Detection

​Performance Optimization Recommendations

​Best Practices

​Monitoring Strategy

​Performance Optimization

​Data Retention

Request Parameters

Response

Example

Advanced Monitoring

Real-time Performance Dashboard

Performance Anomaly Detection

Performance Optimization Recommendations

Best Practices

Monitoring Strategy

Performance Optimization

Data Retention