Alerts

v1.0

Alerts & Notifications Setup

Complete guide to setting up intelligent alerts and notifications in Nodewarden for proactive monitoring.

Last updated: December 5, 2024

8 min read

Alerts & Notifications Setup

Proactive monitoring means being notified of issues before they impact your users. This comprehensive guide covers everything from basic alert setup to advanced notification strategies.

Alert Fundamentals

Understanding Alert Types

Threshold Alerts

Monitor when metrics cross predefined values:

Static Thresholds: Fixed values (e.g., CPU > 80%)
Dynamic Thresholds: Based on historical patterns
Relative Thresholds: Compared to baseline (e.g., 50% above normal)

Anomaly Detection

AI-powered alerts that learn your normal patterns:

Statistical Anomalies: Unusual deviations from normal behavior
Seasonal Patterns: Account for daily/weekly/monthly cycles
Multi-metric Correlations: Consider relationships between metrics

Composite Alerts

Complex conditions combining multiple factors:

AND Conditions: All conditions must be true
OR Conditions: Any condition triggers the alert
Time-based Logic: Consider duration and frequency

Service Health Alerts

Monitor application and service availability:

HTTP/HTTPS Monitoring: Website and API endpoint checks
Port Monitoring: Service availability checks
SSL Certificate Monitoring: Expiration warnings
DNS Monitoring: Domain resolution checks

Creating Your First Alert

Quick Setup Wizard

Navigate to Alerts → Create New Alert
Choose Alert Type: Start with "Threshold Alert"
Select Metric: Choose what to monitor
Set Conditions: Define when to trigger
Configure Notifications: Choose how to be notified
Test & Activate: Verify the alert works

Step-by-Step Configuration

Step 1: Choose What to Monitor

System Metrics:

yaml
# CPU Usage Alert
Metric: system.cpu.usage
Host: web-server-01
Condition: Average over 5 minutes > 85%

Application Metrics:

yaml
# Database Connection Alert
Metric: mysql.connections.active
Service: mysql
Condition: Current value > 90% of max_connections

Custom Metrics:

yaml
# Application Response Time
Metric: app.response_time
Tag: environment=production
Condition: 95th percentile > 2000ms

Step 2: Define Alert Conditions

Basic Threshold:

Metric: CPU Usage
Condition: Greater than 80%
Duration: For at least 5 minutes
Evaluation: Every 1 minute

Advanced Conditions:

yaml
# Multi-condition alert
Conditions:
  - CPU usage > 80% for 5 minutes
  - Memory usage > 90% for 3 minutes
  - Disk I/O wait > 50% for 2 minutes
Logic: Any condition (OR)

Step 3: Set Alert Severity

Severity Levels:

Critical: Immediate action required (page on-call)
Warning: Attention needed (email/Slack)
Info: Informational (dashboard notification)

Severity Guidelines:

Critical: Service down, data loss risk, security breach
Warning: Performance degradation, capacity approaching limits
Info: Maintenance events, configuration changes

Notification Channels

Email Notifications

Basic Email Setup

Go to Settings → Notification Channels
Click Add Email Channel
Configure recipient settings:

yaml
Name: "Production Team"
Recipients:
  - admin@yourcompany.com
  - oncall@yourcompany.com
Format: HTML
Include: Charts and context
Frequency: Immediate

Email Templates

Customize email appearance and content:

Subject Line: Include severity and host information
Body Format: HTML with charts and links to dashboard
Attachments: Include relevant metric screenshots
Footer: Add runbook links and contact information

Slack Integration

Slack Setup

Install Slack App: Add Nodewarden to your workspace
Configure Webhook: Get webhook URL from Slack
Create Channel: Add channel configuration in Nodewarden

yaml
Name: "alerts-production"
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Channel: #alerts-production
Mention: @oncall
Include: Host details and chart thumbnails
Threading: Group related alerts

Slack Alert Features

Rich Formatting: Color-coded messages by severity
Chart Thumbnails: Visual context in notifications
Quick Actions: Acknowledge or snooze from Slack
Thread Grouping: Related alerts in threads
User Mentions: Tag specific team members

Advanced Notification Channels

PagerDuty Integration

For critical alerts requiring immediate response:

Create PagerDuty Service: Set up in PagerDuty console
Get Integration Key: Copy from PagerDuty service
Configure in Nodewarden:

yaml
Service Name: "Production Infrastructure"
Integration Key: your-pagerduty-integration-key
Severity Mapping:
  Critical: P1 (High)
  Warning: P3 (Low)
Auto-resolve: Yes

Webhook Notifications

Integrate with any system via HTTP webhooks:

yaml
Name: "JIRA Ticket Creation"
URL: https://your-company.atlassian.net/webhook
Method: POST
Headers:
  Authorization: "Bearer YOUR_TOKEN"
  Content-Type: "application/json"
Payload:
  project: "INFRA"
  issuetype: "Bug"
  summary: "{{alert.title}}"
  description: "{{alert.description}}"

SMS Notifications

For critical alerts when email/Slack isn't enough:

yaml
Provider: Twilio
Phone Numbers:
  - +1-555-123-4567 (Primary On-call)
  - +1-555-765-4321 (Secondary On-call)
Message Format: "CRITICAL: {{host}} - {{metric}} is {{value}}"
Rate Limiting: Max 5 per hour

Alert Rules Best Practices

Setting Effective Thresholds

Baseline Analysis

Before setting thresholds, understand normal behavior:

Collect Historical Data: At least 2 weeks of normal operation
Identify Patterns: Daily, weekly, seasonal variations
Calculate Statistics: Mean, median, 95th percentile
Set Appropriate Buffers: Account for normal variance

Threshold Guidelines

CPU Usage:

Warning: 70-80% sustained for 5+ minutes
Critical: 90%+ sustained for 2+ minutes
Consider: Number of cores, expected load patterns

Memory Usage:

Warning: 80-85% of available memory
Critical: 95%+ or swap usage increasing
Consider: Application memory patterns, caching

Disk Space:

Warning: 80% of partition capacity
Critical: 90%+ capacity or projected full in 24h
Consider: Growth rate, cleanup processes

Response Time:

Warning: 2x normal response time
Critical: 5x normal or SLA violation
Consider: User experience impact, business requirements

Preventing Alert Fatigue

Smart Alert Design

Correlation: Group related alerts to avoid noise
Dependencies: Don't alert on downstream effects
Escalation: Start with warnings before criticals
Auto-resolution: Resolve alerts when conditions clear

Alert Suppression

yaml
# Maintenance Window
Suppression Rules:
  - Name: "Weekly Maintenance"
    Schedule: "Sunday 02:00-04:00 UTC"
    Hosts: ["web-*", "db-*"]
    Alerts: All except Critical

  - Name: "Deployment Window"
    Trigger: Manual activation
    Duration: 30 minutes
    Hosts: ["production-app-*"]
    Alerts: ["High CPU", "Service Restart"]

Alert Tuning

Regular review and adjustment:

Weekly Reviews: Check false positive rates
Threshold Adjustments: Based on actual incidents
Rule Optimization: Simplify complex conditions
Feedback Loop: Include team input on alert quality

Advanced Alert Scenarios

Multi-Stage Alerting

Escalation Chains

Progressive notification strategy:

yaml
Alert: "Database Connection Pool Exhausted"
Stage 1: (Immediate)
  - Slack: #database-team
  - Email: dba-team@company.com

Stage 2: (After 10 minutes if not acknowledged)
  - PagerDuty: Primary DBA
  - SMS: On-call engineer

Stage 3: (After 30 minutes if not resolved)
  - Email: Engineering Manager
  - Slack: @channel in #incidents
  - Call: Director of Engineering

Business Hours vs After Hours

Different response strategies based on time:

yaml
Business Hours: (Mon-Fri 9AM-6PM)
  - Warning: Slack notification
  - Critical: Slack + Email

After Hours & Weekends:
  - Warning: Email only
  - Critical: PagerDuty + SMS

Application-Specific Alerts

Web Application Monitoring

yaml
# HTTP Response Time Alert
Metric: http.response_time
URL: https://app.yourcompany.com/api/health
Locations: ["us-east", "eu-west", "asia-pacific"]
Conditions:
  Warning: > 2 seconds from any location
  Critical: > 5 seconds or 404/500 errors
Notifications:
  - Slack: #web-team
  - Email: frontend-devs@company.com

Database Performance

yaml
# Database Slow Query Alert
Metric: mysql.slow_queries_per_second
Condition: > 10 queries per second for 5 minutes
Additional Context:
  - Include top 5 slow queries
  - Show current connection count
  - Display buffer pool hit ratio
Runbook: "https://wiki.company.com/db-performance"

Queue Monitoring

yaml
# Background Job Queue Alert
Metric: queue.size
Queue: "email_notifications"
Conditions:
  Warning: > 1000 jobs for 10 minutes
  Critical: > 5000 jobs or processing stopped
Auto-scaling: Trigger additional workers at warning level

Infrastructure Alerts

Docker Container Monitoring

yaml
# Container Health Alert
Conditions:
  - Container restart count > 3 in 1 hour
  - Container memory usage > 90%
  - Container not responding to health checks
Context: Include container logs and resource usage
Actions:
  - Attempt automatic restart
  - Scale up if resource constrained
  - Page DevOps team if automation fails

Kubernetes Cluster Alerts

yaml
# Pod Crash Loop Alert
Metric: kubernetes.pod.restart_count
Condition: > 5 restarts in 15 minutes
Scope: namespace=production
Actions:
  - Collect pod logs and events
  - Check resource quotas
  - Verify configuration changes
  - Alert platform team

Alert Testing & Validation

Testing Your Alerts

Manual Testing

bash
# Simulate high CPU usage
stress --cpu 8 --timeout 300s

# Fill disk space temporarily
dd if=/dev/zero of=/tmp/bigfile bs=1M count=1000

# Generate application errors
curl -X POST https://your-app.com/test/error-500

Automated Testing

yaml
# Weekly Alert Test Schedule
Tests:
  - Name: "CPU Threshold Test"
    Schedule: "Sunday 3AM"
    Action: Simulate 85% CPU for 6 minutes
    Expected: Warning alert within 5 minutes

  - Name: "Memory Leak Simulation"
    Schedule: "Sunday 3:15AM"
    Action: Allocate memory gradually
    Expected: Warning at 80%, Critical at 95%

Alert Metrics & Monitoring

Alert Performance KPIs

Track these metrics to improve your alerting:

Mean Time to Detection (MTTD): How quickly alerts fire
Mean Time to Acknowledgment (MTTA): Response time
Mean Time to Resolution (MTTR): Time to fix issues
False Positive Rate: Percentage of non-actionable alerts
Coverage: Percentage of real issues that generate alerts

Alert Dashboard

Create a dashboard to monitor your monitoring:

yaml
Widgets:
  - Alert Volume: Daily/weekly trends
  - Response Times: MTTA and MTTR trends
  - False Positive Rate: Track improvement over time
  - Top Alerting Hosts: Identify problematic systems
  - Notification Delivery: Success rates by channel

Troubleshooting Alerts

Common Issues

Alerts Not Firing

Symptoms: Expected alerts don't trigger Solutions:

Check alert rule syntax and conditions
Verify metric data is being collected
Confirm host/service filters are correct
Test with simulated conditions

Too Many False Positives

Symptoms: Alerts for non-issues Solutions:

Analyze historical data to adjust thresholds
Add duration requirements to rules
Implement smart baselines instead of static values
Use anomaly detection for complex patterns

Notification Delivery Issues

Symptoms: Alerts fire but notifications aren't received Solutions:

Verify notification channel configuration
Check email/Slack delivery logs
Test webhook endpoints manually
Confirm user permissions and settings

Alert Storm

Symptoms: Hundreds of alerts in short time Solutions:

Implement alert correlation and grouping
Set up alert dependencies
Use circuit breakers for cascade failures
Enable maintenance mode during known issues

Getting Help

Support Resources

Alert Logs: Check Alerts → History for delivery status
Test Tools: Use built-in alert testing features
Documentation: Comprehensive examples and templates
Community: Share experiences and get advice

Professional Services

For complex alerting setups:

Alert Strategy Consultation: Best practices for your environment
Custom Integration Development: Specialized notification channels
Alert Optimization: Reduce noise and improve effectiveness
Training Services: Team education on effective alerting

Next Steps

Now that you have alerts configured:

Custom Dashboards - Visualize your alert data
Integration Guide - Connect alerts with your workflow
Incident Response - Handle alerts effectively
Advanced Monitoring - Sophisticated monitoring strategies
Automation - Automate responses to common issues

Ready for advanced alerting strategies? Check out our Advanced Alerting Guide for machine learning-based detection and complex scenarios.

Alerts & Notifications Setup

Alert Fundamentals

Understanding Alert Types

Threshold Alerts

Anomaly Detection

Composite Alerts

Service Health Alerts

Creating Your First Alert

Quick Setup Wizard

Step-by-Step Configuration

Step 1: Choose What to Monitor

Step 2: Define Alert Conditions

Step 3: Set Alert Severity

Notification Channels

Email Notifications

Basic Email Setup

Email Templates

Slack Integration

Slack Setup

Slack Alert Features

Advanced Notification Channels

PagerDuty Integration

Webhook Notifications

SMS Notifications

Alert Rules Best Practices

Setting Effective Thresholds

Baseline Analysis

Threshold Guidelines

Preventing Alert Fatigue

Smart Alert Design

Alert Suppression

Alert Tuning

Advanced Alert Scenarios

Multi-Stage Alerting

Escalation Chains

Business Hours vs After Hours

Application-Specific Alerts

Web Application Monitoring

Database Performance

Queue Monitoring

Infrastructure Alerts

Docker Container Monitoring

Kubernetes Cluster Alerts

Alert Testing & Validation

Testing Your Alerts

Manual Testing

Automated Testing

Alert Metrics & Monitoring

Alert Performance KPIs

Alert Dashboard

Troubleshooting Alerts

Common Issues

Alerts Not Firing

Too Many False Positives

Notification Delivery Issues

Alert Storm

Getting Help

Support Resources

Professional Services

Next Steps

On this page

Was this page helpful?