Alert Management

Create, manage, and respond to intelligent alerts that notify you of issues, anomalies, and performance degradation across your TensorOne infrastructure.

Request Parameters

status

string

default:"active"

Alert status filter:

active - Currently active alerts
resolved - Recently resolved alerts
all - All alerts regardless of status
acknowledged - Alerts that have been acknowledged
suppressed - Temporarily suppressed alerts

severity

array

Filter by alert severity levels:

critical - System down or severe impact
high - High impact on performance or availability
medium - Noticeable impact, requires attention
low - Minor issues or early warnings
info - Informational alerts

Response

alerts

array

Array of alert objects

Show Alert Object

alertId

string

Unique alert identifier

title

string

Human-readable alert title

description

string

Detailed alert description

severity

string

Alert severity level

Example

curl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "status=active" \
  -d "severity[]=critical&severity[]=high" \
  -d "timeRange=24h" \
  -d "limit=20"

{
  "alerts": [
    {
      "alertId": "alert-critical-001",
      "title": "GPU Cluster High Memory Utilization",
      "description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
      "severity": "critical",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "cluster",
        "resourceId": "cluster-gpu-a100-001",
        "resourceName": "GPU Cluster A100-001",
        "region": "us-east-1"
      },
      "trigger": {
        "metric": "memory_utilization",
        "condition": "greater_than",
        "threshold": 90,
        "currentValue": 97.3,
        "duration": "12m 34s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:45:00Z",
        "lastUpdated": "2024-01-16T17:57:34Z",
        "acknowledged": null,
        "resolved": null
      },
      "impact": {
        "affectedUsers": 42,
        "serviceImpact": "significant",
        "estimatedCost": 125.50,
        "slaImpact": true
      },
      "recommendations": [
        "Scale up cluster to add more memory capacity",
        "Identify and terminate memory-intensive processes",
        "Enable automatic scaling if not already configured"
      ],
      "tags": ["production", "gpu", "memory"],
      "assignee": "ops-team"
    },
    {
      "alertId": "alert-high-002",
      "title": "API Response Time Degradation",
      "description": "Average API response time has increased by 150% over the last 30 minutes",
      "severity": "high",
      "category": "performance",
      "status": "active",
      "source": {
        "resourceType": "api",
        "resourceId": "api-gateway-main",
        "resourceName": "Main API Gateway",
        "region": "global"
      },
      "trigger": {
        "metric": "average_response_time",
        "condition": "greater_than",
        "threshold": 500,
        "currentValue": 847,
        "duration": "32m 18s"
      },
      "timestamps": {
        "triggered": "2024-01-16T17:30:00Z",
        "lastUpdated": "2024-01-16T18:02:18Z",
        "acknowledged": "2024-01-16T17:35:00Z",
        "resolved": null
      },
      "impact": {
        "affectedUsers": 156,
        "serviceImpact": "moderate",
        "estimatedCost": 75.25,
        "slaImpact": false
      },
      "recommendations": [
        "Check for database connection issues",
        "Review recent deployments for performance regressions",
        "Consider enabling API caching for frequent requests"
      ],
      "tags": ["api", "performance", "response-time"],
      "assignee": "backend-team"
    }
  ],
  "summary": {
    "total": 8,
    "bySeverity": {
      "critical": 1,
      "high": 2,
      "medium": 3,
      "low": 2,
      "info": 0
    },
    "byCategory": {
      "performance": 5,
      "availability": 1,
      "capacity": 1,
      "security": 1
    },
    "byStatus": {
      "active": 6,
      "acknowledged": 2,
      "resolved": 0
    },
    "trends": {
      "last24h": 8,
      "previousDay": 12,
      "weeklyAverage": 15.3
    }
  },
  "pagination": {
    "limit": 20,
    "offset": 0,
    "total": 8,
    "hasMore": false
  }
}

Alert Management Operations

Create Custom Alert Rules

Python

def create_alert_rule(name, description, conditions, actions):
    """Create a custom alert rule"""
    rule_data = {
        "name": name,
        "description": description,
        "enabled": True,
        "conditions": conditions,
        "actions": actions,
        "severity": "medium",
        "category": "custom"
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/alerts/rules",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=rule_data
    )
    
    return response.json()

# Create GPU utilization alert
gpu_alert_rule = create_alert_rule(
    name="High GPU Utilization",
    description="Alert when GPU utilization exceeds 95% for more than 5 minutes",
    conditions=[
        {
            "metric": "gpu_utilization",
            "condition": "greater_than",
            "threshold": 95,
            "duration": "5m",
            "resourceType": "cluster"
        }
    ],
    actions=[
        {
            "type": "email",
            "recipients": ["ops@company.com"],
            "template": "gpu_high_utilization"
        },
        {
            "type": "webhook",
            "url": "https://hooks.slack.com/your/webhook/url",
            "payload": {"channel": "#alerts"}
        }
    ]
)

print(f"Created alert rule: {gpu_alert_rule['ruleId']}")

Acknowledge and Resolve Alerts

Python

def acknowledge_alert(alert_id, assignee=None, notes=None):
    """Acknowledge an alert"""
    data = {
        "action": "acknowledge",
        "assignee": assignee,
        "notes": notes,
        "timestamp": datetime.utcnow().isoformat() + "Z"
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

def resolve_alert(alert_id, resolution_notes, resolution_time=None):
    """Resolve an alert"""
    data = {
        "action": "resolve",
        "resolutionNotes": resolution_notes,
        "resolutionTime": resolution_time or datetime.utcnow().isoformat() + "Z"
    }
    
    response = requests.post(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

# Example usage
acknowledge_result = acknowledge_alert(
    "alert-critical-001",
    assignee="john.doe@company.com",
    notes="Investigating memory usage patterns. Scaling up cluster resources."
)

resolve_result = resolve_alert(
    "alert-critical-001",
    resolution_notes="Added additional memory to cluster. Utilization now at 78%."
)

print(f"Alert acknowledged: {acknowledge_result['success']}")
print(f"Alert resolved: {resolve_result['success']}")

Bulk Alert Operations

Python

def bulk_alert_operations(alert_ids, action, **kwargs):
    """Perform bulk operations on multiple alerts"""
    data = {
        "alertIds": alert_ids,
        "action": action,
        **kwargs
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/alerts/bulk",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=data
    )
    
    return response.json()

# Bulk acknowledge alerts
alert_ids = ["alert-001", "alert-002", "alert-003"]
bulk_result = bulk_alert_operations(
    alert_ids,
    action="acknowledge",
    assignee="incident-team@company.com",
    notes="Bulk acknowledged for incident response"
)

print(f"Bulk operation results:")
for result in bulk_result['results']:
    print(f"  {result['alertId']}: {result['status']}")

Advanced Alert Features

Smart Alert Grouping

Group related alerts to reduce noise:

Python

def get_grouped_alerts(grouping_criteria="resource"):
    """Get alerts grouped by specified criteria"""
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/alerts/grouped",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params={
            "groupBy": grouping_criteria,
            "status": "active",
            "timeRange": "24h"
        }
    )
    
    return response.json()

# Get alerts grouped by resource
grouped_alerts = get_grouped_alerts("resource")

print("📊 Grouped Alert Summary:")
print("=" * 30)

for group in grouped_alerts['groups']:
    alert_count = len(group['alerts'])
    severity_counts = {}
    
    for alert in group['alerts']:
        severity = alert['severity']
        severity_counts[severity] = severity_counts.get(severity, 0) + 1
    
    print(f"\n🔧 {group['groupKey']} ({alert_count} alerts)")
    
    for severity, count in severity_counts.items():
        severity_icon = {
            'critical': '🔴',
            'high': '🟠',
            'medium': '🟡',
            'low': '🟢'
        }.get(severity, '⚪')
        print(f"  {severity_icon} {severity}: {count}")
    
    # Show most critical alert in group
    critical_alerts = [a for a in group['alerts'] if a['severity'] == 'critical']
    if critical_alerts:
        alert = critical_alerts[0]
        print(f"  📍 Most critical: {alert['title']}")

Alert Correlation and Root Cause Analysis

Python

def get_alert_correlations(alert_id):
    """Get correlated alerts and potential root causes"""
    response = requests.get(
        f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/correlations",
        headers={"Authorization": "Bearer YOUR_API_KEY"}
    )
    
    return response.json()

def analyze_root_cause(alert_id):
    """Analyze potential root causes for an alert"""
    correlations = get_alert_correlations(alert_id)
    
    print(f"🔍 Root Cause Analysis for Alert: {alert_id}")
    print("=" * 50)
    
    if 'rootCauses' in correlations:
        print("\n🎯 Potential Root Causes:")
        for cause in correlations['rootCauses'][:3]:  # Top 3
            confidence_bar = "█" * int(cause['confidence'] * 10)
            print(f"  {confidence_bar} {cause['confidence']:.0%} - {cause['description']}")
            if cause.get('evidence'):
                print(f"    Evidence: {cause['evidence']}")
    
    if 'correlatedAlerts' in correlations:
        print(f"\n🔗 Correlated Alerts ({len(correlations['correlatedAlerts'])}):")
        for corr_alert in correlations['correlatedAlerts'][:5]:
            print(f"  • {corr_alert['title']} (correlation: {corr_alert['correlationScore']:.2f})")
    
    if 'timeline' in correlations:
        print("\n⏰ Event Timeline:")
        for event in correlations['timeline']:
            print(f"  {event['timestamp']} - {event['description']}")

# Example usage
analyze_root_cause("alert-critical-001")

Predictive Alerting

Python

def get_predictive_alerts():
    """Get predictive alerts based on trends and anomalies"""
    response = requests.get(
        "https://api.tensorone.ai/v2/monitoring/alerts/predictive",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        params={
            "predictionWindow": "2h",
            "confidenceThreshold": 0.7
        }
    )
    
    return response.json()

def display_predictive_alerts():
    """Display predictive alerts dashboard"""
    predictions = get_predictive_alerts()
    
    print("🔮 Predictive Alert Dashboard")
    print("=" * 35)
    
    if predictions['predictions']:
        print(f"\n⚠️  {len(predictions['predictions'])} Potential Issues Detected:")
        
        for prediction in predictions['predictions']:
            confidence_str = f"{prediction['confidence']:.0%}"
            eta = prediction['estimatedTimeToIssue']
            
            risk_icon = "🔴" if prediction['confidence'] > 0.9 else "🟡"
            
            print(f"\n{risk_icon} {prediction['title']} ({confidence_str} confidence)")
            print(f"   Resource: {prediction['resource']['name']}")
            print(f"   Estimated time to issue: {eta}")
            print(f"   Predicted impact: {prediction['predictedSeverity']}")
            
            if prediction['recommendations']:
                print(f"   💡 Preventive action: {prediction['recommendations'][0]}")
    else:
        print("\n✅ No potential issues detected in the next 2 hours")
    
    # Trend analysis
    if 'trends' in predictions:
        print(f"\n📈 Trend Analysis:")
        for trend in predictions['trends']:
            direction = "📈" if trend['direction'] == 'increasing' else "📉"
            print(f"  {direction} {trend['metric']}: {trend['description']}")

display_predictive_alerts()

Integration and Automation

Webhook Integration

Python

def setup_webhook_integration(webhook_url, events=None):
    """Set up webhook integration for alerts"""
    integration_data = {
        "type": "webhook",
        "name": "Alert Webhook Integration",
        "config": {
            "url": webhook_url,
            "method": "POST",
            "headers": {
                "Content-Type": "application/json",
                "Authorization": "Bearer YOUR_WEBHOOK_TOKEN"
            }
        },
        "events": events or [
            "alert.triggered",
            "alert.resolved",
            "alert.acknowledged"
        ],
        "filters": {
            "severity": ["critical", "high"],
            "category": ["performance", "availability"]
        }
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/integrations",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=integration_data
    )
    
    return response.json()

# Set up Slack webhook integration
slack_integration = setup_webhook_integration(
    "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
    events=["alert.triggered", "alert.resolved"]
)

print(f"Webhook integration created: {slack_integration['integrationId']}")

Automated Response Actions

Python

def create_automated_response(trigger_conditions, actions):
    """Create automated response for specific alert conditions"""
    automation_data = {
        "name": "Auto Scale on High CPU",
        "description": "Automatically scale clusters when CPU utilization is high",
        "enabled": True,
        "triggerConditions": trigger_conditions,
        "actions": actions,
        "cooldownPeriod": "10m"  # Prevent rapid successive triggers
    }
    
    response = requests.post(
        "https://api.tensorone.ai/v2/monitoring/automations",
        headers={
            "Authorization": "Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json=automation_data
    )
    
    return response.json()

# Create auto-scaling automation
auto_scale_response = create_automated_response(
    trigger_conditions=[
        {
            "alertCategory": "performance",
            "metric": "cpu_utilization",
            "threshold": 85,
            "duration": "5m",
            "resourceType": "cluster"
        }
    ],
    actions=[
        {
            "type": "scale_cluster",
            "parameters": {
                "scaleDirection": "up",
                "scaleAmount": 1
            }
        },
        {
            "type": "notify",
            "parameters": {
                "channel": "slack",
                "message": "Auto-scaled cluster due to high CPU utilization"
            }
        }
    ]
)

print(f"Automation created: {auto_scale_response['automationId']}")

Best Practices

Alert Configuration

Meaningful Thresholds: Set thresholds based on actual impact, not arbitrary numbers
Appropriate Severity: Match severity to business impact
Clear Descriptions: Write clear, actionable alert descriptions
Proper Categorization: Use consistent categories for easy filtering

Timely Response: Acknowledge critical alerts within minutes
Documentation: Document resolution steps for common issues
Post-Incident Reviews: Analyze alerts after incidents to improve detection
Regular Tuning: Regularly review and adjust alert thresholds

Noise Reduction

Alert Grouping: Group related alerts to reduce noise
Intelligent Suppression: Suppress redundant alerts during maintenance
Escalation Policies: Implement proper escalation for unhandled alerts
Regular Cleanup: Remove obsolete or ineffective alert rules

Alert data is retained for 90 days. Configure webhook integrations to maintain longer historical records in your external systems.

Use alert correlation and predictive features to move from reactive to proactive monitoring. Focus on alerts that indicate real business impact rather than just technical metrics.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Alert Management

Request Parameters

Response

Example

Alert Management Operations

Create Custom Alert Rules

Acknowledge and Resolve Alerts

Bulk Alert Operations

Advanced Alert Features

Smart Alert Grouping

Alert Correlation and Root Cause Analysis

Predictive Alerting

Integration and Automation

Webhook Integration

Automated Response Actions

Best Practices

Alert Configuration

Alert Management

Noise Reduction

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Request Parameters

​Response

​Example

​Alert Management Operations

​Create Custom Alert Rules

​Acknowledge and Resolve Alerts

​Bulk Alert Operations

​Advanced Alert Features

​Smart Alert Grouping

​Alert Correlation and Root Cause Analysis

​Predictive Alerting

​Integration and Automation

​Webhook Integration

​Automated Response Actions

​Best Practices

​Alert Configuration

​Alert Management

​Noise Reduction

Request Parameters

Response

Example

Alert Management Operations

Create Custom Alert Rules

Acknowledge and Resolve Alerts

Bulk Alert Operations

Advanced Alert Features

Smart Alert Grouping

Alert Correlation and Root Cause Analysis

Predictive Alerting

Integration and Automation

Webhook Integration

Automated Response Actions

Best Practices

Alert Configuration

Alert Management

Noise Reduction