Skip to main content

Monitoring Services

Overview

The monitoring stack consists of Prometheus for metrics collection and Grafana for visualization. These services provide comprehensive monitoring, alerting, and observability for the entire SSA backend application.

What These Services Do

  • Prometheus: Collects, stores, and queries time-series metrics from all services
  • Grafana: Provides rich dashboards and visualizations for monitoring data
  • Alerting: Configurable alerts for service health, performance, and errors
  • Metrics Storage: Long-term storage of historical metrics data
  • Service Discovery: Automatic discovery of services and endpoints

How It Works

  1. Metrics Collection: Prometheus scrapes metrics from all services
  2. Data Storage: Time-series data stored in Prometheus
  3. Visualization: Grafana queries Prometheus for dashboard data
  4. Alerting: Prometheus rules trigger alerts based on thresholds
  5. Service Discovery: Automatic detection of new services and endpoints

Core Monitoring Endpoints

Health Check

Endpoint: GET /health

Description: Comprehensive health check endpoint that provides:

  • Application status
  • Database connectivity status
  • System information
  • Service health indicators

Authentication: Required (JWT Bearer Token)

Response Example:

{
"status": "healthy",
"timestamp": 1705315200.0,
"version": "1.0.0",
"services": {
"api": "healthy",
"database": "healthy",
"logging": "healthy"
},
"system": {
"hostname": "server-hostname",
"platform": "Linux",
"python_version": "3.11.0"
}
}

Metrics Endpoint

Endpoint: GET /metrics

Description: Prometheus metrics endpoint exposing:

  • HTTP request metrics
  • Performance metrics
  • System resource metrics
  • User activity metrics
  • Database metrics

Authentication: Not required (public endpoint)

Use Cases:

  • Prometheus scraping for metrics collection
  • Grafana dashboard data sources
  • Custom monitoring scripts
  • Performance analysis

Debug Endpoints

Endpoint: GET /debug/active-users

Description: Debug information for active users tracking system.

Authentication: Required (JWT Bearer Token)

Response Example:

{
"total_active_users": 25,
"cleanup_interval_seconds": 300,
"user_timeout_seconds": 1800,
"active_users": {
"123": {
"last_activity": 1705314900.0,
"time_since_activity_seconds": 300.0,
"will_be_removed_in_seconds": 1500.0
}
}
}

Prometheus

Configuration

Prometheus Configuration (prometheus/prometheus.yml):

global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "rules/alerts.yml"

scrape_configs:
- job_name: 'fastapi'
static_configs:
- targets: ['fastapi:8000']
metrics_path: '/metrics'
scrape_interval: 30s

- job_name: 'news-processor'
static_configs:
- targets: ['news-processor:8000']
metrics_path: '/metrics'
scrape_interval: 30s

- job_name: 'postgres'
static_configs:
- targets: ['postgres:5432']
scrape_interval: 60s

- job_name: 'redis'
static_configs:
- targets: ['redis:6379']
scrape_interval: 60s

Metrics Endpoints

FastAPI Metrics (/metrics):

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/v1/satellites"} 1234
http_requests_total{method="POST",endpoint="/v1/norad_db"} 567
http_requests_total{method="GET",endpoint="/v1/news/latest"} 890

# HELP http_request_duration_seconds Duration of HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250

# HELP active_users Current number of active users
# TYPE active_users gauge
active_users 25

# HELP user_activity_total Total user activity events
# TYPE user_activity_total counter
user_activity_total{user_id="123"} 45

Available Metrics:

  • http_requests_total: Total HTTP requests by method and endpoint
  • http_request_duration_seconds: Request duration distribution
  • active_users: Current number of active users
  • user_activity_total: User activity tracking
  • database_connections: Database connection pool status
  • system_resources: CPU, memory, and disk usage

News Processor Metrics (/metrics):

# HELP news_articles_total Total articles collected
# TYPE news_articles_total counter
news_articles_total{source="LaunchLibrary2"} 1500
news_articles_total{source="RocketLaunch.Live"} 800
news_articles_total{source="SpaceflightNewsAPI"} 2000

# HELP news_collection_duration_seconds Time taken for collection
# TYPE news_collection_duration_seconds histogram
news_collection_duration_seconds_bucket{le="10"} 50
news_collection_duration_seconds_bucket{le="30"} 100
news_collection_duration_seconds_bucket{le="60"} 120

Alerting Rules

Alert Configuration (prometheus/rules/alerts.yml):

groups:
- name: ssa_alerts
rules:
- alert: FastAPIDown
expr: up{job="fastapi"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "FastAPI service is down"
description: "FastAPI service has been down for more than 1 minute"

- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for the last 5 minutes"

- alert: DatabaseConnectionIssues
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection issues"
description: "PostgreSQL database is not responding"

- alert: NewsCollectionFailure
expr: news_collection_errors > 0
for: 5m
labels:
severity: warning
annotations:
summary: "News collection failures"
description: "News processor is experiencing collection errors"

Grafana

Dashboards

FastAPI Dashboard (grafana/provisioning/dashboards/fastapi-dashboard.json):

{
"dashboard": {
"title": "FastAPI Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
}
]
}
}

Satellite Operations Dashboard (grafana/provisioning/dashboards/satellite-operations-dashboard.json):

{
"dashboard": {
"title": "Satellite Operations Dashboard",
"panels": [
{
"title": "Satellite Queries",
"type": "graph",
"targets": [
{
"expr": "rate(satellite_queries_total[5m])",
"legendFormat": "Queries per second"
}
]
},
{
"title": "Pass Analysis Tasks",
"type": "stat",
"targets": [
{
"expr": "pass_analysis_tasks_total",
"legendFormat": "Total tasks"
}
]
},
{
"title": "Maneuver Detections",
"type": "graph",
"targets": [
{
"expr": "rate(maneuver_detections_total[5m])",
"legendFormat": "Detections per second"
}
]
}
]
}
}

Data Sources

Prometheus Data Source (grafana/provisioning/datasources/prometheus.yml):

apiVersion: 1

datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true

Usage

Access Monitoring

# Access Prometheus
open http://localhost:9090

# Access Grafana
open http://localhost:3003
# Default credentials: admin/admin

Check Service Status

# Check Prometheus status
docker-compose ps prometheus

# Check Grafana status
docker-compose ps grafana

# View Prometheus targets
curl http://localhost:9090/api/v1/targets

View Metrics

# View FastAPI metrics
curl http://localhost:8005/metrics

# View news processor metrics
curl http://localhost:8007/metrics

# Query Prometheus
curl "http://localhost:9090/api/v1/query?query=up"

Monitoring Best Practices

Key Metrics to Monitor

  1. Service Health:

    • Service uptime (up metric)
    • Response times
    • Error rates
  2. Performance:

    • Request rates
    • Database query performance
    • Memory and CPU usage
  3. Business Metrics:

    • Satellite queries per day
    • Pass analysis completion rate
    • News articles collected
  4. Infrastructure:

    • Database connections
    • Redis memory usage
    • Disk space

Alerting Strategy

  1. Critical Alerts:

    • Service down
    • Database unavailable
    • High error rates
  2. Warning Alerts:

    • Performance degradation
    • Resource usage high
    • Collection failures
  3. Info Alerts:

    • Service restarts
    • Configuration changes
    • Maintenance events

Troubleshooting

Common Issues

  1. Prometheus Not Scraping:

    # Check targets
    curl http://localhost:9090/api/v1/targets

    # Check service discovery
    curl http://localhost:9090/api/v1/targets?state=active
  2. Grafana Not Loading:

    # Check Grafana logs
    docker-compose logs grafana

    # Check data source
    curl http://localhost:3003/api/datasources
  3. Metrics Not Available:

    # Check service metrics endpoint
    curl http://localhost:8005/metrics

    # Check Prometheus configuration
    docker exec -it prometheus cat /etc/prometheus/prometheus.yml

Performance Tuning

  1. Prometheus Storage:

    • Configure retention period
    • Optimize scrape intervals
    • Use remote storage for long-term data
  2. Grafana Optimization:

    • Limit dashboard refresh rates
    • Use query optimization
    • Configure caching
  3. Resource Management:

    • Monitor memory usage
    • Configure resource limits
    • Use persistent volumes

Integration

With FastAPI Service

  • Metrics Endpoint: /metrics exposes Prometheus metrics
  • Health Checks: Built-in health check endpoints
  • Custom Metrics: Application-specific metrics

With Other Services

  • News Processor: Metrics exposed on port 8007
  • TLE Processor: Log-based monitoring
  • Database: PostgreSQL metrics via exporter

External Integrations

  • Slack: Alert notifications
  • Email: Alert notifications
  • PagerDuty: Incident management
  • Webhooks: Custom integrations