Monitoring Services

Overview

The monitoring stack consists of Prometheus for metrics collection and Grafana for visualization. These services provide comprehensive monitoring, alerting, and observability for the entire SSA backend application.

What These Services Do

Prometheus: Collects, stores, and queries time-series metrics from all services
Grafana: Provides rich dashboards and visualizations for monitoring data
Alerting: Configurable alerts for service health, performance, and errors
Metrics Storage: Long-term storage of historical metrics data
Service Discovery: Automatic discovery of services and endpoints

How It Works

Metrics Collection: Prometheus scrapes metrics from all services
Data Storage: Time-series data stored in Prometheus
Visualization: Grafana queries Prometheus for dashboard data
Alerting: Prometheus rules trigger alerts based on thresholds
Service Discovery: Automatic detection of new services and endpoints

Core Monitoring Endpoints

Health Check

Endpoint: GET /health

Description: Comprehensive health check endpoint that provides:

Application status
Database connectivity status
System information
Service health indicators

Authentication: Required (JWT Bearer Token)

Response Example:

{
  "status": "healthy",
  "timestamp": 1705315200.0,
  "version": "1.0.0",
  "services": {
    "api": "healthy",
    "database": "healthy",
    "logging": "healthy"
  },
  "system": {
    "hostname": "server-hostname",
    "platform": "Linux",
    "python_version": "3.11.0"
  }
}

Metrics Endpoint

Endpoint: GET /metrics

Description: Prometheus metrics endpoint exposing:

HTTP request metrics
Performance metrics
System resource metrics
User activity metrics
Database metrics

Authentication: Not required (public endpoint)

Use Cases:

Prometheus scraping for metrics collection
Grafana dashboard data sources
Custom monitoring scripts
Performance analysis

Debug Endpoints

Endpoint: GET /debug/active-users

Description: Debug information for active users tracking system.

Authentication: Required (JWT Bearer Token)

Response Example:

{
  "total_active_users": 25,
  "cleanup_interval_seconds": 300,
  "user_timeout_seconds": 1800,
  "active_users": {
    "123": {
      "last_activity": 1705314900.0,
      "time_since_activity_seconds": 300.0,
      "will_be_removed_in_seconds": 1500.0
    }
  }
}

Prometheus

Configuration

Prometheus Configuration (prometheus/prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/alerts.yml"

scrape_configs:
  - job_name: 'fastapi'
    static_configs:
      - targets: ['fastapi:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'news-processor'
    static_configs:
      - targets: ['news-processor:8000']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres:5432']
    scrape_interval: 60s

  - job_name: 'redis'
    static_configs:
      - targets: ['redis:6379']
    scrape_interval: 60s

Metrics Endpoints

FastAPI Metrics (/metrics):

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/v1/satellites"} 1234
http_requests_total{method="POST",endpoint="/v1/norad_db"} 567
http_requests_total{method="GET",endpoint="/v1/news/latest"} 890

# HELP http_request_duration_seconds Duration of HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250

# HELP active_users Current number of active users
# TYPE active_users gauge
active_users 25

# HELP user_activity_total Total user activity events
# TYPE user_activity_total counter
user_activity_total{user_id="123"} 45

Available Metrics:

http_requests_total: Total HTTP requests by method and endpoint
http_request_duration_seconds: Request duration distribution
active_users: Current number of active users
user_activity_total: User activity tracking
database_connections: Database connection pool status
system_resources: CPU, memory, and disk usage

News Processor Metrics (/metrics):

# HELP news_articles_total Total articles collected
# TYPE news_articles_total counter
news_articles_total{source="LaunchLibrary2"} 1500
news_articles_total{source="RocketLaunch.Live"} 800
news_articles_total{source="SpaceflightNewsAPI"} 2000

# HELP news_collection_duration_seconds Time taken for collection
# TYPE news_collection_duration_seconds histogram
news_collection_duration_seconds_bucket{le="10"} 50
news_collection_duration_seconds_bucket{le="30"} 100
news_collection_duration_seconds_bucket{le="60"} 120

Alerting Rules

Alert Configuration (prometheus/rules/alerts.yml):

groups:
  - name: ssa_alerts
    rules:
      - alert: FastAPIDown
        expr: up{job="fastapi"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "FastAPI service is down"
          description: "FastAPI service has been down for more than 1 minute"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 10% for the last 5 minutes"

      - alert: DatabaseConnectionIssues
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection issues"
          description: "PostgreSQL database is not responding"

      - alert: NewsCollectionFailure
        expr: news_collection_errors > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "News collection failures"
          description: "News processor is experiencing collection errors"

Grafana

Dashboards

FastAPI Dashboard (grafana/provisioning/dashboards/fastapi-dashboard.json):

{
  "dashboard": {
    "title": "FastAPI Service Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "5xx errors"
          }
        ]
      }
    ]
  }
}

Satellite Operations Dashboard (grafana/provisioning/dashboards/satellite-operations-dashboard.json):

{
  "dashboard": {
    "title": "Satellite Operations Dashboard",
    "panels": [
      {
        "title": "Satellite Queries",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(satellite_queries_total[5m])",
            "legendFormat": "Queries per second"
          }
        ]
      },
      {
        "title": "Pass Analysis Tasks",
        "type": "stat",
        "targets": [
          {
            "expr": "pass_analysis_tasks_total",
            "legendFormat": "Total tasks"
          }
        ]
      },
      {
        "title": "Maneuver Detections",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(maneuver_detections_total[5m])",
            "legendFormat": "Detections per second"
          }
        ]
      }
    ]
  }
}

Data Sources

Prometheus Data Source (grafana/provisioning/datasources/prometheus.yml):

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Usage

Access Monitoring

# Access Prometheus
open http://localhost:9090

# Access Grafana
open http://localhost:3003
# Default credentials: admin/admin

Check Service Status

# Check Prometheus status
docker-compose ps prometheus

# Check Grafana status
docker-compose ps grafana

# View Prometheus targets
curl http://localhost:9090/api/v1/targets

View Metrics

# View FastAPI metrics
curl http://localhost:8005/metrics

# View news processor metrics
curl http://localhost:8007/metrics

# Query Prometheus
curl "http://localhost:9090/api/v1/query?query=up"

Monitoring Best Practices

Key Metrics to Monitor

Service Health:
- Service uptime (up metric)
- Response times
- Error rates
Performance:
- Request rates
- Database query performance
- Memory and CPU usage
Business Metrics:
- Satellite queries per day
- Pass analysis completion rate
- News articles collected
Infrastructure:
- Database connections
- Redis memory usage
- Disk space

Alerting Strategy

Critical Alerts:
- Service down
- Database unavailable
- High error rates
Warning Alerts:
- Performance degradation
- Resource usage high
- Collection failures
Info Alerts:
- Service restarts
- Configuration changes
- Maintenance events

Troubleshooting

Common Issues

Prometheus Not Scraping:

# Check targets
curl http://localhost:9090/api/v1/targets

# Check service discovery
curl http://localhost:9090/api/v1/targets?state=active

Grafana Not Loading:

# Check Grafana logs
docker-compose logs grafana

# Check data source
curl http://localhost:3003/api/datasources

Metrics Not Available:

# Check service metrics endpoint
curl http://localhost:8005/metrics

# Check Prometheus configuration
docker exec -it prometheus cat /etc/prometheus/prometheus.yml

Performance Tuning

Prometheus Storage:
- Configure retention period
- Optimize scrape intervals
- Use remote storage for long-term data
Grafana Optimization:
- Limit dashboard refresh rates
- Use query optimization
- Configure caching
Resource Management:
- Monitor memory usage
- Configure resource limits
- Use persistent volumes

Integration

With FastAPI Service

Metrics Endpoint: /metrics exposes Prometheus metrics
Health Checks: Built-in health check endpoints
Custom Metrics: Application-specific metrics

With Other Services

News Processor: Metrics exposed on port 8007
TLE Processor: Log-based monitoring
Database: PostgreSQL metrics via exporter

External Integrations

Slack: Alert notifications
Email: Alert notifications
PagerDuty: Incident management
Webhooks: Custom integrations

Overview​

What These Services Do​

How It Works​

Core Monitoring Endpoints​

Health Check​

Metrics Endpoint​

Debug Endpoints​

Prometheus​

Configuration​

Metrics Endpoints​

Alerting Rules​

Grafana​

Dashboards​

Data Sources​

Usage​

Access Monitoring​

Check Service Status​

View Metrics​

Monitoring Best Practices​

Key Metrics to Monitor​

Alerting Strategy​

Troubleshooting​

Common Issues​

Performance Tuning​

Integration​

With FastAPI Service​

With Other Services​

External Integrations​