Monitoring Services
Overview
The monitoring stack consists of Prometheus for metrics collection and Grafana for visualization. These services provide comprehensive monitoring, alerting, and observability for the entire SSA backend application.
What These Services Do
- Prometheus: Collects, stores, and queries time-series metrics from all services
- Grafana: Provides rich dashboards and visualizations for monitoring data
- Alerting: Configurable alerts for service health, performance, and errors
- Metrics Storage: Long-term storage of historical metrics data
- Service Discovery: Automatic discovery of services and endpoints
How It Works
- Metrics Collection: Prometheus scrapes metrics from all services
- Data Storage: Time-series data stored in Prometheus
- Visualization: Grafana queries Prometheus for dashboard data
- Alerting: Prometheus rules trigger alerts based on thresholds
- Service Discovery: Automatic detection of new services and endpoints
Core Monitoring Endpoints
Health Check
Endpoint: GET /health
Description: Comprehensive health check endpoint that provides:
- Application status
- Database connectivity status
- System information
- Service health indicators
Authentication: Required (JWT Bearer Token)
Response Example:
{
"status": "healthy",
"timestamp": 1705315200.0,
"version": "1.0.0",
"services": {
"api": "healthy",
"database": "healthy",
"logging": "healthy"
},
"system": {
"hostname": "server-hostname",
"platform": "Linux",
"python_version": "3.11.0"
}
}
Metrics Endpoint
Endpoint: GET /metrics
Description: Prometheus metrics endpoint exposing:
- HTTP request metrics
- Performance metrics
- System resource metrics
- User activity metrics
- Database metrics
Authentication: Not required (public endpoint)
Use Cases:
- Prometheus scraping for metrics collection
- Grafana dashboard data sources
- Custom monitoring scripts
- Performance analysis
Debug Endpoints
Endpoint: GET /debug/active-users
Description: Debug information for active users tracking system.
Authentication: Required (JWT Bearer Token)
Response Example:
{
"total_active_users": 25,
"cleanup_interval_seconds": 300,
"user_timeout_seconds": 1800,
"active_users": {
"123": {
"last_activity": 1705314900.0,
"time_since_activity_seconds": 300.0,
"will_be_removed_in_seconds": 1500.0
}
}
}
Prometheus
Configuration
Prometheus Configuration (prometheus/prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/alerts.yml"
scrape_configs:
- job_name: 'fastapi'
static_configs:
- targets: ['fastapi:8000']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'news-processor'
static_configs:
- targets: ['news-processor:8000']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'postgres'
static_configs:
- targets: ['postgres:5432']
scrape_interval: 60s
- job_name: 'redis'
static_configs:
- targets: ['redis:6379']
scrape_interval: 60s
Metrics Endpoints
FastAPI Metrics (/metrics):
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/v1/satellites"} 1234
http_requests_total{method="POST",endpoint="/v1/norad_db"} 567
http_requests_total{method="GET",endpoint="/v1/news/latest"} 890
# HELP http_request_duration_seconds Duration of HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250
# HELP active_users Current number of active users
# TYPE active_users gauge
active_users 25
# HELP user_activity_total Total user activity events
# TYPE user_activity_total counter
user_activity_total{user_id="123"} 45
Available Metrics:
- http_requests_total: Total HTTP requests by method and endpoint
- http_request_duration_seconds: Request duration distribution
- active_users: Current number of active users
- user_activity_total: User activity tracking
- database_connections: Database connection pool status
- system_resources: CPU, memory, and disk usage
News Processor Metrics (/metrics):
# HELP news_articles_total Total articles collected
# TYPE news_articles_total counter
news_articles_total{source="LaunchLibrary2"} 1500
news_articles_total{source="RocketLaunch.Live"} 800
news_articles_total{source="SpaceflightNewsAPI"} 2000
# HELP news_collection_duration_seconds Time taken for collection
# TYPE news_collection_duration_seconds histogram
news_collection_duration_seconds_bucket{le="10"} 50
news_collection_duration_seconds_bucket{le="30"} 100
news_collection_duration_seconds_bucket{le="60"} 120
Alerting Rules
Alert Configuration (prometheus/rules/alerts.yml):
groups:
- name: ssa_alerts
rules:
- alert: FastAPIDown
expr: up{job="fastapi"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "FastAPI service is down"
description: "FastAPI service has been down for more than 1 minute"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for the last 5 minutes"
- alert: DatabaseConnectionIssues
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection issues"
description: "PostgreSQL database is not responding"
- alert: NewsCollectionFailure
expr: news_collection_errors > 0
for: 5m
labels:
severity: warning
annotations:
summary: "News collection failures"
description: "News processor is experiencing collection errors"
Grafana
Dashboards
FastAPI Dashboard (grafana/provisioning/dashboards/fastapi-dashboard.json):
{
"dashboard": {
"title": "FastAPI Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
}
]
}
}
Satellite Operations Dashboard (grafana/provisioning/dashboards/satellite-operations-dashboard.json):
{
"dashboard": {
"title": "Satellite Operations Dashboard",
"panels": [
{
"title": "Satellite Queries",
"type": "graph",
"targets": [
{
"expr": "rate(satellite_queries_total[5m])",
"legendFormat": "Queries per second"
}
]
},
{
"title": "Pass Analysis Tasks",
"type": "stat",
"targets": [
{
"expr": "pass_analysis_tasks_total",
"legendFormat": "Total tasks"
}
]
},
{
"title": "Maneuver Detections",
"type": "graph",
"targets": [
{
"expr": "rate(maneuver_detections_total[5m])",
"legendFormat": "Detections per second"
}
]
}
]
}
}
Data Sources
Prometheus Data Source (grafana/provisioning/datasources/prometheus.yml):
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Usage
Access Monitoring
# Access Prometheus
open http://localhost:9090
# Access Grafana
open http://localhost:3003
# Default credentials: admin/admin
Check Service Status
# Check Prometheus status
docker-compose ps prometheus
# Check Grafana status
docker-compose ps grafana
# View Prometheus targets
curl http://localhost:9090/api/v1/targets
View Metrics
# View FastAPI metrics
curl http://localhost:8005/metrics
# View news processor metrics
curl http://localhost:8007/metrics
# Query Prometheus
curl "http://localhost:9090/api/v1/query?query=up"
Monitoring Best Practices
Key Metrics to Monitor
-
Service Health:
- Service uptime (
upmetric) - Response times
- Error rates
- Service uptime (
-
Performance:
- Request rates
- Database query performance
- Memory and CPU usage
-
Business Metrics:
- Satellite queries per day
- Pass analysis completion rate
- News articles collected
-
Infrastructure:
- Database connections
- Redis memory usage
- Disk space
Alerting Strategy
-
Critical Alerts:
- Service down
- Database unavailable
- High error rates
-
Warning Alerts:
- Performance degradation
- Resource usage high
- Collection failures
-
Info Alerts:
- Service restarts
- Configuration changes
- Maintenance events
Troubleshooting
Common Issues
-
Prometheus Not Scraping:
# Check targets
curl http://localhost:9090/api/v1/targets
# Check service discovery
curl http://localhost:9090/api/v1/targets?state=active -
Grafana Not Loading:
# Check Grafana logs
docker-compose logs grafana
# Check data source
curl http://localhost:3003/api/datasources -
Metrics Not Available:
# Check service metrics endpoint
curl http://localhost:8005/metrics
# Check Prometheus configuration
docker exec -it prometheus cat /etc/prometheus/prometheus.yml
Performance Tuning
-
Prometheus Storage:
- Configure retention period
- Optimize scrape intervals
- Use remote storage for long-term data
-
Grafana Optimization:
- Limit dashboard refresh rates
- Use query optimization
- Configure caching
-
Resource Management:
- Monitor memory usage
- Configure resource limits
- Use persistent volumes
Integration
With FastAPI Service
- Metrics Endpoint:
/metricsexposes Prometheus metrics - Health Checks: Built-in health check endpoints
- Custom Metrics: Application-specific metrics
With Other Services
- News Processor: Metrics exposed on port 8007
- TLE Processor: Log-based monitoring
- Database: PostgreSQL metrics via exporter
External Integrations
- Slack: Alert notifications
- Email: Alert notifications
- PagerDuty: Incident management
- Webhooks: Custom integrations