Alerts & Notifications Setup
Complete guide to setting up intelligent alerts and notifications in Nodewarden for proactive monitoring.
Alerts & Notifications Setup
Proactive monitoring means being notified of issues before they impact your users. This comprehensive guide covers everything from basic alert setup to advanced notification strategies.
Alert Fundamentals
Understanding Alert Types
Threshold Alerts
Monitor when metrics cross predefined values:
- Static Thresholds: Fixed values (e.g., CPU > 80%)
- Dynamic Thresholds: Based on historical patterns
- Relative Thresholds: Compared to baseline (e.g., 50% above normal)
Anomaly Detection
AI-powered alerts that learn your normal patterns:
- Statistical Anomalies: Unusual deviations from normal behavior
- Seasonal Patterns: Account for daily/weekly/monthly cycles
- Multi-metric Correlations: Consider relationships between metrics
Composite Alerts
Complex conditions combining multiple factors:
- AND Conditions: All conditions must be true
- OR Conditions: Any condition triggers the alert
- Time-based Logic: Consider duration and frequency
Service Health Alerts
Monitor application and service availability:
- HTTP/HTTPS Monitoring: Website and API endpoint checks
- Port Monitoring: Service availability checks
- SSL Certificate Monitoring: Expiration warnings
- DNS Monitoring: Domain resolution checks
Creating Your First Alert
Quick Setup Wizard
- Navigate to Alerts → Create New Alert
- Choose Alert Type: Start with "Threshold Alert"
- Select Metric: Choose what to monitor
- Set Conditions: Define when to trigger
- Configure Notifications: Choose how to be notified
- Test & Activate: Verify the alert works
Step-by-Step Configuration
Step 1: Choose What to Monitor
System Metrics:
yaml# CPU Usage Alert Metric: system.cpu.usage Host: web-server-01 Condition: Average over 5 minutes > 85%
Application Metrics:
yaml# Database Connection Alert Metric: mysql.connections.active Service: mysql Condition: Current value > 90% of max_connections
Custom Metrics:
yaml# Application Response Time Metric: app.response_time Tag: environment=production Condition: 95th percentile > 2000ms
Step 2: Define Alert Conditions
Basic Threshold:
- Metric: CPU Usage
- Condition: Greater than 80%
- Duration: For at least 5 minutes
- Evaluation: Every 1 minute
Advanced Conditions:
yaml# Multi-condition alert Conditions: - CPU usage > 80% for 5 minutes - Memory usage > 90% for 3 minutes - Disk I/O wait > 50% for 2 minutes Logic: Any condition (OR)
Step 3: Set Alert Severity
Severity Levels:
- Critical: Immediate action required (page on-call)
- Warning: Attention needed (email/Slack)
- Info: Informational (dashboard notification)
Severity Guidelines:
- Critical: Service down, data loss risk, security breach
- Warning: Performance degradation, capacity approaching limits
- Info: Maintenance events, configuration changes
Notification Channels
Email Notifications
Basic Email Setup
- Go to Settings → Notification Channels
- Click Add Email Channel
- Configure recipient settings:
yamlName: "Production Team" Recipients: - admin@yourcompany.com - oncall@yourcompany.com Format: HTML Include: Charts and context Frequency: Immediate
Email Templates
Customize email appearance and content:
- Subject Line: Include severity and host information
- Body Format: HTML with charts and links to dashboard
- Attachments: Include relevant metric screenshots
- Footer: Add runbook links and contact information
Slack Integration
Slack Setup
- Install Slack App: Add Nodewarden to your workspace
- Configure Webhook: Get webhook URL from Slack
- Create Channel: Add channel configuration in Nodewarden
yamlName: "alerts-production" Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL Channel: #alerts-production Mention: @oncall Include: Host details and chart thumbnails Threading: Group related alerts
Slack Alert Features
- Rich Formatting: Color-coded messages by severity
- Chart Thumbnails: Visual context in notifications
- Quick Actions: Acknowledge or snooze from Slack
- Thread Grouping: Related alerts in threads
- User Mentions: Tag specific team members
Advanced Notification Channels
PagerDuty Integration
For critical alerts requiring immediate response:
- Create PagerDuty Service: Set up in PagerDuty console
- Get Integration Key: Copy from PagerDuty service
- Configure in Nodewarden:
yamlService Name: "Production Infrastructure" Integration Key: your-pagerduty-integration-key Severity Mapping: Critical: P1 (High) Warning: P3 (Low) Auto-resolve: Yes
Webhook Notifications
Integrate with any system via HTTP webhooks:
yamlName: "JIRA Ticket Creation" URL: https://your-company.atlassian.net/webhook Method: POST Headers: Authorization: "Bearer YOUR_TOKEN" Content-Type: "application/json" Payload: project: "INFRA" issuetype: "Bug" summary: "{{alert.title}}" description: "{{alert.description}}"
SMS Notifications
For critical alerts when email/Slack isn't enough:
yamlProvider: Twilio Phone Numbers: - +1-555-123-4567 (Primary On-call) - +1-555-765-4321 (Secondary On-call) Message Format: "CRITICAL: {{host}} - {{metric}} is {{value}}" Rate Limiting: Max 5 per hour
Alert Rules Best Practices
Setting Effective Thresholds
Baseline Analysis
Before setting thresholds, understand normal behavior:
- Collect Historical Data: At least 2 weeks of normal operation
- Identify Patterns: Daily, weekly, seasonal variations
- Calculate Statistics: Mean, median, 95th percentile
- Set Appropriate Buffers: Account for normal variance
Threshold Guidelines
CPU Usage:
- Warning: 70-80% sustained for 5+ minutes
- Critical: 90%+ sustained for 2+ minutes
- Consider: Number of cores, expected load patterns
Memory Usage:
- Warning: 80-85% of available memory
- Critical: 95%+ or swap usage increasing
- Consider: Application memory patterns, caching
Disk Space:
- Warning: 80% of partition capacity
- Critical: 90%+ capacity or projected full in 24h
- Consider: Growth rate, cleanup processes
Response Time:
- Warning: 2x normal response time
- Critical: 5x normal or SLA violation
- Consider: User experience impact, business requirements
Preventing Alert Fatigue
Smart Alert Design
- Correlation: Group related alerts to avoid noise
- Dependencies: Don't alert on downstream effects
- Escalation: Start with warnings before criticals
- Auto-resolution: Resolve alerts when conditions clear
Alert Suppression
yaml# Maintenance Window Suppression Rules: - Name: "Weekly Maintenance" Schedule: "Sunday 02:00-04:00 UTC" Hosts: ["web-*", "db-*"] Alerts: All except Critical - Name: "Deployment Window" Trigger: Manual activation Duration: 30 minutes Hosts: ["production-app-*"] Alerts: ["High CPU", "Service Restart"]
Alert Tuning
Regular review and adjustment:
- Weekly Reviews: Check false positive rates
- Threshold Adjustments: Based on actual incidents
- Rule Optimization: Simplify complex conditions
- Feedback Loop: Include team input on alert quality
Advanced Alert Scenarios
Multi-Stage Alerting
Escalation Chains
Progressive notification strategy:
yamlAlert: "Database Connection Pool Exhausted" Stage 1: (Immediate) - Slack: #database-team - Email: dba-team@company.com Stage 2: (After 10 minutes if not acknowledged) - PagerDuty: Primary DBA - SMS: On-call engineer Stage 3: (After 30 minutes if not resolved) - Email: Engineering Manager - Slack: @channel in #incidents - Call: Director of Engineering
Business Hours vs After Hours
Different response strategies based on time:
yamlBusiness Hours: (Mon-Fri 9AM-6PM) - Warning: Slack notification - Critical: Slack + Email After Hours & Weekends: - Warning: Email only - Critical: PagerDuty + SMS
Application-Specific Alerts
Web Application Monitoring
yaml# HTTP Response Time Alert Metric: http.response_time URL: https://app.yourcompany.com/api/health Locations: ["us-east", "eu-west", "asia-pacific"] Conditions: Warning: > 2 seconds from any location Critical: > 5 seconds or 404/500 errors Notifications: - Slack: #web-team - Email: frontend-devs@company.com
Database Performance
yaml# Database Slow Query Alert Metric: mysql.slow_queries_per_second Condition: > 10 queries per second for 5 minutes Additional Context: - Include top 5 slow queries - Show current connection count - Display buffer pool hit ratio Runbook: "https://wiki.company.com/db-performance"
Queue Monitoring
yaml# Background Job Queue Alert Metric: queue.size Queue: "email_notifications" Conditions: Warning: > 1000 jobs for 10 minutes Critical: > 5000 jobs or processing stopped Auto-scaling: Trigger additional workers at warning level
Infrastructure Alerts
Docker Container Monitoring
yaml# Container Health Alert Conditions: - Container restart count > 3 in 1 hour - Container memory usage > 90% - Container not responding to health checks Context: Include container logs and resource usage Actions: - Attempt automatic restart - Scale up if resource constrained - Page DevOps team if automation fails
Kubernetes Cluster Alerts
yaml# Pod Crash Loop Alert Metric: kubernetes.pod.restart_count Condition: > 5 restarts in 15 minutes Scope: namespace=production Actions: - Collect pod logs and events - Check resource quotas - Verify configuration changes - Alert platform team
Alert Testing & Validation
Testing Your Alerts
Manual Testing
bash# Simulate high CPU usage stress --cpu 8 --timeout 300s # Fill disk space temporarily dd if=/dev/zero of=/tmp/bigfile bs=1M count=1000 # Generate application errors curl -X POST https://your-app.com/test/error-500
Automated Testing
yaml# Weekly Alert Test Schedule Tests: - Name: "CPU Threshold Test" Schedule: "Sunday 3AM" Action: Simulate 85% CPU for 6 minutes Expected: Warning alert within 5 minutes - Name: "Memory Leak Simulation" Schedule: "Sunday 3:15AM" Action: Allocate memory gradually Expected: Warning at 80%, Critical at 95%
Alert Metrics & Monitoring
Alert Performance KPIs
Track these metrics to improve your alerting:
- Mean Time to Detection (MTTD): How quickly alerts fire
- Mean Time to Acknowledgment (MTTA): Response time
- Mean Time to Resolution (MTTR): Time to fix issues
- False Positive Rate: Percentage of non-actionable alerts
- Coverage: Percentage of real issues that generate alerts
Alert Dashboard
Create a dashboard to monitor your monitoring:
yamlWidgets: - Alert Volume: Daily/weekly trends - Response Times: MTTA and MTTR trends - False Positive Rate: Track improvement over time - Top Alerting Hosts: Identify problematic systems - Notification Delivery: Success rates by channel
Troubleshooting Alerts
Common Issues
Alerts Not Firing
Symptoms: Expected alerts don't trigger Solutions:
- Check alert rule syntax and conditions
- Verify metric data is being collected
- Confirm host/service filters are correct
- Test with simulated conditions
Too Many False Positives
Symptoms: Alerts for non-issues Solutions:
- Analyze historical data to adjust thresholds
- Add duration requirements to rules
- Implement smart baselines instead of static values
- Use anomaly detection for complex patterns
Notification Delivery Issues
Symptoms: Alerts fire but notifications aren't received Solutions:
- Verify notification channel configuration
- Check email/Slack delivery logs
- Test webhook endpoints manually
- Confirm user permissions and settings
Alert Storm
Symptoms: Hundreds of alerts in short time Solutions:
- Implement alert correlation and grouping
- Set up alert dependencies
- Use circuit breakers for cascade failures
- Enable maintenance mode during known issues
Getting Help
Support Resources
- Alert Logs: Check Alerts → History for delivery status
- Test Tools: Use built-in alert testing features
- Documentation: Comprehensive examples and templates
- Community: Share experiences and get advice
Professional Services
For complex alerting setups:
- Alert Strategy Consultation: Best practices for your environment
- Custom Integration Development: Specialized notification channels
- Alert Optimization: Reduce noise and improve effectiveness
- Training Services: Team education on effective alerting
Next Steps
Now that you have alerts configured:
- Custom Dashboards - Visualize your alert data
- Integration Guide - Connect alerts with your workflow
- Incident Response - Handle alerts effectively
- Advanced Monitoring - Sophisticated monitoring strategies
- Automation - Automate responses to common issues
Ready for advanced alerting strategies? Check out our Advanced Alerting Guide for machine learning-based detection and complex scenarios.