Alerts
v1.0

Alerts & Notifications Setup

Complete guide to setting up intelligent alerts and notifications in Nodewarden for proactive monitoring.

Last updated: December 5, 2024
8 min read

Alerts & Notifications Setup

Proactive monitoring means being notified of issues before they impact your users. This comprehensive guide covers everything from basic alert setup to advanced notification strategies.

Alert Fundamentals

Understanding Alert Types

Threshold Alerts

Monitor when metrics cross predefined values:

  • Static Thresholds: Fixed values (e.g., CPU > 80%)
  • Dynamic Thresholds: Based on historical patterns
  • Relative Thresholds: Compared to baseline (e.g., 50% above normal)

Anomaly Detection

AI-powered alerts that learn your normal patterns:

  • Statistical Anomalies: Unusual deviations from normal behavior
  • Seasonal Patterns: Account for daily/weekly/monthly cycles
  • Multi-metric Correlations: Consider relationships between metrics

Composite Alerts

Complex conditions combining multiple factors:

  • AND Conditions: All conditions must be true
  • OR Conditions: Any condition triggers the alert
  • Time-based Logic: Consider duration and frequency

Service Health Alerts

Monitor application and service availability:

  • HTTP/HTTPS Monitoring: Website and API endpoint checks
  • Port Monitoring: Service availability checks
  • SSL Certificate Monitoring: Expiration warnings
  • DNS Monitoring: Domain resolution checks

Creating Your First Alert

Quick Setup Wizard

  1. Navigate to AlertsCreate New Alert
  2. Choose Alert Type: Start with "Threshold Alert"
  3. Select Metric: Choose what to monitor
  4. Set Conditions: Define when to trigger
  5. Configure Notifications: Choose how to be notified
  6. Test & Activate: Verify the alert works

Step-by-Step Configuration

Step 1: Choose What to Monitor

System Metrics:

yaml
# CPU Usage Alert
Metric: system.cpu.usage
Host: web-server-01
Condition: Average over 5 minutes > 85%

Application Metrics:

yaml
# Database Connection Alert
Metric: mysql.connections.active
Service: mysql
Condition: Current value > 90% of max_connections

Custom Metrics:

yaml
# Application Response Time
Metric: app.response_time
Tag: environment=production
Condition: 95th percentile > 2000ms

Step 2: Define Alert Conditions

Basic Threshold:

  • Metric: CPU Usage
  • Condition: Greater than 80%
  • Duration: For at least 5 minutes
  • Evaluation: Every 1 minute

Advanced Conditions:

yaml
# Multi-condition alert
Conditions:
  - CPU usage > 80% for 5 minutes
  - Memory usage > 90% for 3 minutes
  - Disk I/O wait > 50% for 2 minutes
Logic: Any condition (OR)

Step 3: Set Alert Severity

Severity Levels:

  • Critical: Immediate action required (page on-call)
  • Warning: Attention needed (email/Slack)
  • Info: Informational (dashboard notification)

Severity Guidelines:

  • Critical: Service down, data loss risk, security breach
  • Warning: Performance degradation, capacity approaching limits
  • Info: Maintenance events, configuration changes

Notification Channels

Email Notifications

Basic Email Setup

  1. Go to SettingsNotification Channels
  2. Click Add Email Channel
  3. Configure recipient settings:
yaml
Name: "Production Team"
Recipients:
  - admin@yourcompany.com
  - oncall@yourcompany.com
Format: HTML
Include: Charts and context
Frequency: Immediate

Email Templates

Customize email appearance and content:

  • Subject Line: Include severity and host information
  • Body Format: HTML with charts and links to dashboard
  • Attachments: Include relevant metric screenshots
  • Footer: Add runbook links and contact information

Slack Integration

Slack Setup

  1. Install Slack App: Add Nodewarden to your workspace
  2. Configure Webhook: Get webhook URL from Slack
  3. Create Channel: Add channel configuration in Nodewarden
yaml
Name: "alerts-production"
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Channel: #alerts-production
Mention: @oncall
Include: Host details and chart thumbnails
Threading: Group related alerts

Slack Alert Features

  • Rich Formatting: Color-coded messages by severity
  • Chart Thumbnails: Visual context in notifications
  • Quick Actions: Acknowledge or snooze from Slack
  • Thread Grouping: Related alerts in threads
  • User Mentions: Tag specific team members

Advanced Notification Channels

PagerDuty Integration

For critical alerts requiring immediate response:

  1. Create PagerDuty Service: Set up in PagerDuty console
  2. Get Integration Key: Copy from PagerDuty service
  3. Configure in Nodewarden:
yaml
Service Name: "Production Infrastructure"
Integration Key: your-pagerduty-integration-key
Severity Mapping:
  Critical: P1 (High)
  Warning: P3 (Low)
Auto-resolve: Yes

Webhook Notifications

Integrate with any system via HTTP webhooks:

yaml
Name: "JIRA Ticket Creation"
URL: https://your-company.atlassian.net/webhook
Method: POST
Headers:
  Authorization: "Bearer YOUR_TOKEN"
  Content-Type: "application/json"
Payload:
  project: "INFRA"
  issuetype: "Bug"
  summary: "{{alert.title}}"
  description: "{{alert.description}}"

SMS Notifications

For critical alerts when email/Slack isn't enough:

yaml
Provider: Twilio
Phone Numbers:
  - +1-555-123-4567 (Primary On-call)
  - +1-555-765-4321 (Secondary On-call)
Message Format: "CRITICAL: {{host}} - {{metric}} is {{value}}"
Rate Limiting: Max 5 per hour

Alert Rules Best Practices

Setting Effective Thresholds

Baseline Analysis

Before setting thresholds, understand normal behavior:

  1. Collect Historical Data: At least 2 weeks of normal operation
  2. Identify Patterns: Daily, weekly, seasonal variations
  3. Calculate Statistics: Mean, median, 95th percentile
  4. Set Appropriate Buffers: Account for normal variance

Threshold Guidelines

CPU Usage:

  • Warning: 70-80% sustained for 5+ minutes
  • Critical: 90%+ sustained for 2+ minutes
  • Consider: Number of cores, expected load patterns

Memory Usage:

  • Warning: 80-85% of available memory
  • Critical: 95%+ or swap usage increasing
  • Consider: Application memory patterns, caching

Disk Space:

  • Warning: 80% of partition capacity
  • Critical: 90%+ capacity or projected full in 24h
  • Consider: Growth rate, cleanup processes

Response Time:

  • Warning: 2x normal response time
  • Critical: 5x normal or SLA violation
  • Consider: User experience impact, business requirements

Preventing Alert Fatigue

Smart Alert Design

  • Correlation: Group related alerts to avoid noise
  • Dependencies: Don't alert on downstream effects
  • Escalation: Start with warnings before criticals
  • Auto-resolution: Resolve alerts when conditions clear

Alert Suppression

yaml
# Maintenance Window
Suppression Rules:
  - Name: "Weekly Maintenance"
    Schedule: "Sunday 02:00-04:00 UTC"
    Hosts: ["web-*", "db-*"]
    Alerts: All except Critical

  - Name: "Deployment Window"
    Trigger: Manual activation
    Duration: 30 minutes
    Hosts: ["production-app-*"]
    Alerts: ["High CPU", "Service Restart"]

Alert Tuning

Regular review and adjustment:

  • Weekly Reviews: Check false positive rates
  • Threshold Adjustments: Based on actual incidents
  • Rule Optimization: Simplify complex conditions
  • Feedback Loop: Include team input on alert quality

Advanced Alert Scenarios

Multi-Stage Alerting

Escalation Chains

Progressive notification strategy:

yaml
Alert: "Database Connection Pool Exhausted"
Stage 1: (Immediate)
  - Slack: #database-team
  - Email: dba-team@company.com

Stage 2: (After 10 minutes if not acknowledged)
  - PagerDuty: Primary DBA
  - SMS: On-call engineer

Stage 3: (After 30 minutes if not resolved)
  - Email: Engineering Manager
  - Slack: @channel in #incidents
  - Call: Director of Engineering

Business Hours vs After Hours

Different response strategies based on time:

yaml
Business Hours: (Mon-Fri 9AM-6PM)
  - Warning: Slack notification
  - Critical: Slack + Email

After Hours & Weekends:
  - Warning: Email only
  - Critical: PagerDuty + SMS

Application-Specific Alerts

Web Application Monitoring

yaml
# HTTP Response Time Alert
Metric: http.response_time
URL: https://app.yourcompany.com/api/health
Locations: ["us-east", "eu-west", "asia-pacific"]
Conditions:
  Warning: > 2 seconds from any location
  Critical: > 5 seconds or 404/500 errors
Notifications:
  - Slack: #web-team
  - Email: frontend-devs@company.com

Database Performance

yaml
# Database Slow Query Alert
Metric: mysql.slow_queries_per_second
Condition: > 10 queries per second for 5 minutes
Additional Context:
  - Include top 5 slow queries
  - Show current connection count
  - Display buffer pool hit ratio
Runbook: "https://wiki.company.com/db-performance"

Queue Monitoring

yaml
# Background Job Queue Alert
Metric: queue.size
Queue: "email_notifications"
Conditions:
  Warning: > 1000 jobs for 10 minutes
  Critical: > 5000 jobs or processing stopped
Auto-scaling: Trigger additional workers at warning level

Infrastructure Alerts

Docker Container Monitoring

yaml
# Container Health Alert
Conditions:
  - Container restart count > 3 in 1 hour
  - Container memory usage > 90%
  - Container not responding to health checks
Context: Include container logs and resource usage
Actions:
  - Attempt automatic restart
  - Scale up if resource constrained
  - Page DevOps team if automation fails

Kubernetes Cluster Alerts

yaml
# Pod Crash Loop Alert
Metric: kubernetes.pod.restart_count
Condition: > 5 restarts in 15 minutes
Scope: namespace=production
Actions:
  - Collect pod logs and events
  - Check resource quotas
  - Verify configuration changes
  - Alert platform team

Alert Testing & Validation

Testing Your Alerts

Manual Testing

bash
# Simulate high CPU usage
stress --cpu 8 --timeout 300s

# Fill disk space temporarily
dd if=/dev/zero of=/tmp/bigfile bs=1M count=1000

# Generate application errors
curl -X POST https://your-app.com/test/error-500

Automated Testing

yaml
# Weekly Alert Test Schedule
Tests:
  - Name: "CPU Threshold Test"
    Schedule: "Sunday 3AM"
    Action: Simulate 85% CPU for 6 minutes
    Expected: Warning alert within 5 minutes

  - Name: "Memory Leak Simulation"
    Schedule: "Sunday 3:15AM"
    Action: Allocate memory gradually
    Expected: Warning at 80%, Critical at 95%

Alert Metrics & Monitoring

Alert Performance KPIs

Track these metrics to improve your alerting:

  • Mean Time to Detection (MTTD): How quickly alerts fire
  • Mean Time to Acknowledgment (MTTA): Response time
  • Mean Time to Resolution (MTTR): Time to fix issues
  • False Positive Rate: Percentage of non-actionable alerts
  • Coverage: Percentage of real issues that generate alerts

Alert Dashboard

Create a dashboard to monitor your monitoring:

yaml
Widgets:
  - Alert Volume: Daily/weekly trends
  - Response Times: MTTA and MTTR trends
  - False Positive Rate: Track improvement over time
  - Top Alerting Hosts: Identify problematic systems
  - Notification Delivery: Success rates by channel

Troubleshooting Alerts

Common Issues

Alerts Not Firing

Symptoms: Expected alerts don't trigger Solutions:

  1. Check alert rule syntax and conditions
  2. Verify metric data is being collected
  3. Confirm host/service filters are correct
  4. Test with simulated conditions

Too Many False Positives

Symptoms: Alerts for non-issues Solutions:

  1. Analyze historical data to adjust thresholds
  2. Add duration requirements to rules
  3. Implement smart baselines instead of static values
  4. Use anomaly detection for complex patterns

Notification Delivery Issues

Symptoms: Alerts fire but notifications aren't received Solutions:

  1. Verify notification channel configuration
  2. Check email/Slack delivery logs
  3. Test webhook endpoints manually
  4. Confirm user permissions and settings

Alert Storm

Symptoms: Hundreds of alerts in short time Solutions:

  1. Implement alert correlation and grouping
  2. Set up alert dependencies
  3. Use circuit breakers for cascade failures
  4. Enable maintenance mode during known issues

Getting Help

Support Resources

  • Alert Logs: Check Alerts → History for delivery status
  • Test Tools: Use built-in alert testing features
  • Documentation: Comprehensive examples and templates
  • Community: Share experiences and get advice

Professional Services

For complex alerting setups:

  • Alert Strategy Consultation: Best practices for your environment
  • Custom Integration Development: Specialized notification channels
  • Alert Optimization: Reduce noise and improve effectiveness
  • Training Services: Team education on effective alerting

Next Steps

Now that you have alerts configured:

  1. Custom Dashboards - Visualize your alert data
  2. Integration Guide - Connect alerts with your workflow
  3. Incident Response - Handle alerts effectively
  4. Advanced Monitoring - Sophisticated monitoring strategies
  5. Automation - Automate responses to common issues

Ready for advanced alerting strategies? Check out our Advanced Alerting Guide for machine learning-based detection and complex scenarios.

Was this page helpful?

Help us improve our documentation

    Alerts & Notifications Setup | Nodewarden Documentation