curl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "status=active" \
  -d "severity[]=critical&severity[]=high" \
  -d "timeRange=24h" \
  -d "limit=20"
{
  "alerts": [
    {
      "alertId": "alert-critical-001",
      "title": "GPU Cluster High Memory Utilization",
      "description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
      "severity": "critical",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "cluster",
        "resourceId": "cluster-gpu-a100-001",
        "resourceName": "GPU Cluster A100-001",
        "region": "us-east-1"
      },
      "trigger": {
        "metric": "memory_utilization",
        "condition": "greater_than",
        "threshold": 90,
        "currentValue": 97.3,
        "duration": "12m 34s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:45:00Z",
        "lastUpdated": "2024-01-16T17:57:34Z",
        "acknowledged": null,
        "resolved": null
      },
      "impact": {
        "affectedUsers": 42,
        "serviceImpact": "significant",
        "estimatedCost": 125.50,
        "slaImpact": true
      },
      "recommendations": [
        "Scale up cluster to add more memory capacity",
        "Identify and terminate memory-intensive processes",
        "Enable automatic scaling if not already configured"
      ],
      "tags": ["production", "gpu", "memory"],
      "assignee": "ops-team"
    },
    {
      "alertId": "alert-high-002",
      "title": "API Response Time Degradation",
      "description": "Average API response time has increased by 150% over the last 30 minutes",
      "severity": "high",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "api",
        "resourceId": "api-gateway-main",
        "resourceName": "Main API Gateway",
        "region": "global"
      },
      "trigger": {
        "metric": "average_response_time",
        "condition": "greater_than",
        "threshold": 500,
        "currentValue": 847,
        "duration": "32m 18s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:30:00Z",
        "lastUpdated": "2024-01-16T18:02:18Z",
        "acknowledged": "2024-01-16T17:35:00Z",
        "resolved": null
      },
      "impact": {
        "affectedUsers": 156,
        "serviceImpact": "moderate",
        "estimatedCost": 75.25,
        "slaImpact": false
      },
      "recommendations": [
        "Check for database connection issues",
        "Review recent deployments for performance regressions",
        "Consider enabling API caching for frequent requests"
      ],
      "tags": ["api", "performance", "response-time"],
      "assignee": "backend-team"
    }
  ],
  "summary": {
    "total": 8,
    "bySeverity": {
      "critical": 1,
      "high": 2,
      "medium": 3,
      "low": 2,
      "info": 0
    },
    "byCategory": {
      "performance": 5,
      "availability": 1,
      "capacity": 1,
      "security": 1
    },
    "byStatus": {
      "active": 6,
      "acknowledged": 2,
      "resolved": 0
    },
    "trends": {
      "last24h": 8,
      "previousDay": 12,
      "weeklyAverage": 15.3
    }
  },
  "pagination": {
    "limit": 20,
    "offset": 0,
    "total": 8,
    "hasMore": false
  }
}
Create, manage, and respond to intelligent alerts that notify you of issues, anomalies, and performance degradation across your TensorOne infrastructure.

Request Parameters

status
string
default:"active"
Alert status filter:
  • active - Currently active alerts
  • resolved - Recently resolved alerts
  • all - All alerts regardless of status
  • acknowledged - Alerts that have been acknowledged
  • suppressed - Temporarily suppressed alerts
severity
array
Filter by alert severity levels:
  • critical - System down or severe impact
  • high - High impact on performance or availability
  • medium - Noticeable impact, requires attention
  • low - Minor issues or early warnings
  • info - Informational alerts
category
array
Alert categories to include:
  • performance - Performance degradation alerts
  • availability - Service availability issues
  • capacity - Resource capacity warnings
  • security - Security-related alerts
  • cost - Cost threshold alerts
  • maintenance - Maintenance and update alerts
resource
string
Filter alerts by resource type:
  • clusters - GPU cluster alerts
  • endpoints - Serverless endpoint alerts
  • training - Training job alerts
  • ai-services - AI service alerts
  • infrastructure - Platform infrastructure alerts
timeRange
string
default:"24h"
Time range for alert history:
  • 1h - Last hour
  • 6h - Last 6 hours
  • 24h - Last 24 hours
  • 7d - Last 7 days
  • 30d - Last 30 days
resourceIds
array
Specific resource IDs to filter alerts for
tags
array
Filter alerts by custom tags
limit
integer
default:"50"
Maximum number of alerts to return (1-500)
offset
integer
default:"0"
Number of alerts to skip for pagination

Response

alerts
array
Array of alert objects
summary
object
Alert summary statistics
pagination
object
Pagination information

Example

curl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "status=active" \
  -d "severity[]=critical&severity[]=high" \
  -d "timeRange=24h" \
  -d "limit=20"
{
  "alerts": [
    {
      "alertId": "alert-critical-001",
      "title": "GPU Cluster High Memory Utilization",
      "description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
      "severity": "critical",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "cluster",
        "resourceId": "cluster-gpu-a100-001",
        "resourceName": "GPU Cluster A100-001",
        "region": "us-east-1"
      },
      "trigger": {
        "metric": "memory_utilization",
        "condition": "greater_than",
        "threshold": 90,
        "currentValue": 97.3,
        "duration": "12m 34s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:45:00Z",
        "lastUpdated": "2024-01-16T17:57:34Z",
        "acknowledged": null,
        "resolved": null
      },
      "impact": {
        "affectedUsers": 42,
        "serviceImpact": "significant",
        "estimatedCost": 125.50,
        "slaImpact": true
      },
      "recommendations": [
        "Scale up cluster to add more memory capacity",
        "Identify and terminate memory-intensive processes",
        "Enable automatic scaling if not already configured"
      ],
      "tags": ["production", "gpu", "memory"],
      "assignee": "ops-team"
    },
    {
      "alertId": "alert-high-002",
      "title": "API Response Time Degradation",
      "description": "Average API response time has increased by 150% over the last 30 minutes",
      "severity": "high",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "api",
        "resourceId": "api-gateway-main",
        "resourceName": "Main API Gateway",
        "region": "global"
      },
      "trigger": {
        "metric": "average_response_time",
        "condition": "greater_than",
        "threshold": 500,
        "currentValue": 847,
        "duration": "32m 18s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:30:00Z",
        "lastUpdated": "2024-01-16T18:02:18Z",
        "acknowledged": "2024-01-16T17:35:00Z",
        "resolved": null
      },
      "impact": {
        "affectedUsers": 156,
        "serviceImpact": "moderate",
        "estimatedCost": 75.25,
        "slaImpact": false
      },
      "recommendations": [
        "Check for database connection issues",
        "Review recent deployments for performance regressions",
        "Consider enabling API caching for frequent requests"
      ],
      "tags": ["api", "performance", "response-time"],
      "assignee": "backend-team"
    }
  ],
  "summary": {
    "total": 8,
    "bySeverity": {
      "critical": 1,
      "high": 2,
      "medium": 3,
      "low": 2,
      "info": 0
    },
    "byCategory": {
      "performance": 5,
      "availability": 1,
      "capacity": 1,
      "security": 1
    },
    "byStatus": {
      "active": 6,
      "acknowledged": 2,
      "resolved": 0
    },
    "trends": {
      "last24h": 8,
      "previousDay": 12,
      "weeklyAverage": 15.3
    }
  },
  "pagination": {
    "limit": 20,
    "offset": 0,
    "total": 8,
    "hasMore": false
  }
}

Alert Management Operations

Create Custom Alert Rules

Python
def create_alert_rule(name, description, conditions, actions):
    """Create a custom alert rule"""
    rule_data = {
        "name": name,
        "description": description,
        "enabled": True,
        "conditions": conditions,
        "actions": actions,
        "severity": "medium",
        "category": "custom"
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/alerts/rules",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=rule_data
    )
    
    return response.json()

# Create GPU utilization alert
gpu_alert_rule = create_alert_rule(
    name="High GPU Utilization",
    description="Alert when GPU utilization exceeds 95% for more than 5 minutes",
    conditions=[
        {
            "metric": "gpu_utilization",
            "condition": "greater_than",
            "threshold": 95,
            "duration": "5m",
            "resourceType": "cluster"
        }
    ],
    actions=[
        {
            "type": "email",
            "recipients": ["ops@company.com"],
            "template": "gpu_high_utilization"
        },
        {
            "type": "webhook",
            "url": "https://hooks.slack.com/your/webhook/url",
            "payload": {"channel": "#alerts"}
        }
    ]
)

print(f"Created alert rule: {gpu_alert_rule['ruleId']}")

Acknowledge and Resolve Alerts

Python
def acknowledge_alert(alert_id, assignee=None, notes=None):
    """Acknowledge an alert"""
    data = {
        "action": "acknowledge",
        "assignee": assignee,
        "notes": notes,
        "timestamp": datetime.utcnow().isoformat() + "Z"
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

def resolve_alert(alert_id, resolution_notes, resolution_time=None):
    """Resolve an alert"""
    data = {
        "action": "resolve",
        "resolutionNotes": resolution_notes,
        "resolutionTime": resolution_time or datetime.utcnow().isoformat() + "Z"
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

# Example usage
acknowledge_result = acknowledge_alert(
    "alert-critical-001",
    assignee="john.doe@company.com",
    notes="Investigating memory usage patterns. Scaling up cluster resources."
)

resolve_result = resolve_alert(
    "alert-critical-001",
    resolution_notes="Added additional memory to cluster. Utilization now at 78%."
)

print(f"Alert acknowledged: {acknowledge_result['success']}")
print(f"Alert resolved: {resolve_result['success']}")

Bulk Alert Operations

Python
def bulk_alert_operations(alert_ids, action, **kwargs):
    """Perform bulk operations on multiple alerts"""
    data = {
        "alertIds": alert_ids,
        "action": action,
        **kwargs
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/alerts/bulk",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

# Bulk acknowledge alerts
alert_ids = ["alert-001", "alert-002", "alert-003"]
bulk_result = bulk_alert_operations(
    alert_ids,
    action="acknowledge",
    assignee="incident-team@company.com",
    notes="Bulk acknowledged for incident response"
)

print(f"Bulk operation results:")
for result in bulk_result['results']:
    print(f"  {result['alertId']}: {result['status']}")

Advanced Alert Features

Smart Alert Grouping

Group related alerts to reduce noise:
Python
def get_grouped_alerts(grouping_criteria="resource"):
    """Get alerts grouped by specified criteria"""
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/alerts/grouped",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params={
            "groupBy": grouping_criteria,
            "status": "active",
            "timeRange": "24h"
        }
    )
    
    return response.json()

# Get alerts grouped by resource
grouped_alerts = get_grouped_alerts("resource")

print("📊 Grouped Alert Summary:")
print("=" * 30)

for group in grouped_alerts['groups']:
    alert_count = len(group['alerts'])
    severity_counts = {}
    
    for alert in group['alerts']:
        severity = alert['severity']
        severity_counts[severity] = severity_counts.get(severity, 0) + 1
    
    print(f"\n🔧 {group['groupKey']} ({alert_count} alerts)")
    
    for severity, count in severity_counts.items():
        severity_icon = {
            'critical': '🔴',
            'high': '🟠',
            'medium': '🟡',
            'low': '🟢'
        }.get(severity, '⚪')
        print(f"  {severity_icon} {severity}: {count}")
    
    # Show most critical alert in group
    critical_alerts = [a for a in group['alerts'] if a['severity'] == 'critical']
    if critical_alerts:
        alert = critical_alerts[0]
        print(f"  📍 Most critical: {alert['title']}")

Alert Correlation and Root Cause Analysis

Python
def get_alert_correlations(alert_id):
    """Get correlated alerts and potential root causes"""
    response = requests.get(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/correlations",
        headers={"Authorization": "Bearer YOUR_API_KEY"}
    )
    
    return response.json()

def analyze_root_cause(alert_id):
    """Analyze potential root causes for an alert"""
    correlations = get_alert_correlations(alert_id)
    
    print(f"🔍 Root Cause Analysis for Alert: {alert_id}")
    print("=" * 50)
    
    if 'rootCauses' in correlations:
        print("\n🎯 Potential Root Causes:")
        for cause in correlations['rootCauses'][:3]:  # Top 3
            confidence_bar = "█" * int(cause['confidence'] * 10)
            print(f"  {confidence_bar} {cause['confidence']:.0%} - {cause['description']}")
            if cause.get('evidence'):
                print(f"    Evidence: {cause['evidence']}")
    
    if 'correlatedAlerts' in correlations:
        print(f"\n🔗 Correlated Alerts ({len(correlations['correlatedAlerts'])}):")
        for corr_alert in correlations['correlatedAlerts'][:5]:
            print(f"  • {corr_alert['title']} (correlation: {corr_alert['correlationScore']:.2f})")
    
    if 'timeline' in correlations:
        print("\n⏰ Event Timeline:")
        for event in correlations['timeline']:
            print(f"  {event['timestamp']} - {event['description']}")

# Example usage
analyze_root_cause("alert-critical-001")

Predictive Alerting

Python
def get_predictive_alerts():
    """Get predictive alerts based on trends and anomalies"""
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/alerts/predictive",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params={
            "predictionWindow": "2h",
            "confidenceThreshold": 0.7
        }
    )
    
    return response.json()

def display_predictive_alerts():
    """Display predictive alerts dashboard"""
    predictions = get_predictive_alerts()
    
    print("🔮 Predictive Alert Dashboard")
    print("=" * 35)
    
    if predictions['predictions']:
        print(f"\n⚠️  {len(predictions['predictions'])} Potential Issues Detected:")
        
        for prediction in predictions['predictions']:
            confidence_str = f"{prediction['confidence']:.0%}"
            eta = prediction['estimatedTimeToIssue']
            
            risk_icon = "🔴" if prediction['confidence'] > 0.9 else "🟡"
            
            print(f"\n{risk_icon} {prediction['title']} ({confidence_str} confidence)")
            print(f"   Resource: {prediction['resource']['name']}")
            print(f"   Estimated time to issue: {eta}")
            print(f"   Predicted impact: {prediction['predictedSeverity']}")
            
            if prediction['recommendations']:
                print(f"   💡 Preventive action: {prediction['recommendations'][0]}")
    else:
        print("\n✅ No potential issues detected in the next 2 hours")
    
    # Trend analysis
    if 'trends' in predictions:
        print(f"\n📈 Trend Analysis:")
        for trend in predictions['trends']:
            direction = "📈" if trend['direction'] == 'increasing' else "📉"
            print(f"  {direction} {trend['metric']}: {trend['description']}")

display_predictive_alerts()

Integration and Automation

Webhook Integration

Python
def setup_webhook_integration(webhook_url, events=None):
    """Set up webhook integration for alerts"""
    integration_data = {
        "type": "webhook",
        "name": "Alert Webhook Integration",
        "config": {
            "url": webhook_url,
            "method": "POST",
            "headers": {
                "Content-Type": "application/json",
                "Authorization": "Bearer YOUR_WEBHOOK_TOKEN"
            }
        },
        "events": events or [
            "alert.triggered",
            "alert.resolved",
            "alert.acknowledged"
        ],
        "filters": {
            "severity": ["critical", "high"],
            "category": ["performance", "availability"]
        }
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/integrations",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=integration_data
    )
    
    return response.json()

# Set up Slack webhook integration
slack_integration = setup_webhook_integration(
    "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
    events=["alert.triggered", "alert.resolved"]
)

print(f"Webhook integration created: {slack_integration['integrationId']}")

Automated Response Actions

Python
def create_automated_response(trigger_conditions, actions):
    """Create automated response for specific alert conditions"""
    automation_data = {
        "name": "Auto Scale on High CPU",
        "description": "Automatically scale clusters when CPU utilization is high",
        "enabled": True,
        "triggerConditions": trigger_conditions,
        "actions": actions,
        "cooldownPeriod": "10m"  # Prevent rapid successive triggers
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/automations",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=automation_data
    )
    
    return response.json()

# Create auto-scaling automation
auto_scale_response = create_automated_response(
    trigger_conditions=[
        {
            "alertCategory": "performance",
            "metric": "cpu_utilization",
            "threshold": 85,
            "duration": "5m",
            "resourceType": "cluster"
        }
    ],
    actions=[
        {
            "type": "scale_cluster",
            "parameters": {
                "scaleDirection": "up",
                "scaleAmount": 1
            }
        },
        {
            "type": "notify",
            "parameters": {
                "channel": "slack",
                "message": "Auto-scaled cluster due to high CPU utilization"
            }
        }
    ]
)

print(f"Automation created: {auto_scale_response['automationId']}")

Best Practices

Alert Configuration

  • Meaningful Thresholds: Set thresholds based on actual impact, not arbitrary numbers
  • Appropriate Severity: Match severity to business impact
  • Clear Descriptions: Write clear, actionable alert descriptions
  • Proper Categorization: Use consistent categories for easy filtering

Alert Management

  • Timely Response: Acknowledge critical alerts within minutes
  • Documentation: Document resolution steps for common issues
  • Post-Incident Reviews: Analyze alerts after incidents to improve detection
  • Regular Tuning: Regularly review and adjust alert thresholds

Noise Reduction

  • Alert Grouping: Group related alerts to reduce noise
  • Intelligent Suppression: Suppress redundant alerts during maintenance
  • Escalation Policies: Implement proper escalation for unhandled alerts
  • Regular Cleanup: Remove obsolete or ineffective alert rules
Alert data is retained for 90 days. Configure webhook integrations to maintain longer historical records in your external systems.
Use alert correlation and predictive features to move from reactive to proactive monitoring. Focus on alerts that indicate real business impact rather than just technical metrics.