Skip to main content

News Processor Service

Overview

The News Processor Service is a containerized Python application that continuously fetches space-related news articles from multiple sources and stores them in a PostgreSQL database. It provides comprehensive monitoring, logging, and metrics collection.

What This Service Does

  • Multi-source data collection: Fetches news from LaunchLibrary2, RocketLaunch.Live, and Spaceflight News API
  • Automatic deduplication: Prevents duplicate articles using title, source, and date constraints
  • Structured logging: JSON-formatted logs with structured data
  • Prometheus metrics: Comprehensive metrics for monitoring and alerting
  • Graceful shutdown: Handles SIGTERM and SIGINT signals properly
  • Error handling: Robust error handling with retry mechanisms
  • Database integration: Uses existing PostgreSQL service with automatic table creation
  • Slack notifications: Real-time alerts for service status, errors, and data collection results

How It Works

  1. Startup: Service initializes database connection and creates required tables
  2. Data Collection: Fetches news from multiple APIs in parallel
  3. Deduplication: Checks for existing articles before insertion
  4. Storage: Stores articles in PostgreSQL with TimescaleDB features
  5. Metrics: Collects and exposes Prometheus metrics
  6. Notifications: Sends Slack notifications for status and errors
  7. Scheduling: Runs continuously with configurable intervals

Data Sources

LaunchLibrary2

RocketLaunch.Live

Spaceflight News API

Configuration

Environment Variables

VariableDefaultDescription
DB_HOSTpostgresPostgreSQL host
DB_PORT5432PostgreSQL port
POSTGRES_DBspace_dataDatabase name
POSTGRES_USERadminDatabase username
POSTGRES_PASSWORDpsswdDatabase password
METRICS_PORT8000Prometheus metrics port
SLEEP_INTERVAL1800Sleep interval between cycles (seconds)
RETRY_INTERVAL300Retry interval on errors (seconds)
SLACK_WEBHOOK_URL-Slack webhook URL for notifications

Database Schema

The service creates and manages the space_news_prod table:

CREATE TABLE space_news_prod (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title VARCHAR(500) NOT NULL,
short_description TEXT,
source VARCHAR(100) NOT NULL,
mission_name VARCHAR(200),
date TIMESTAMP WITH TIME ZONE,
url TEXT,
slug VARCHAR(200),
last_updated TIMESTAMP WITH TIME ZONE,
country_code VARCHAR(10),
mission_description TEXT,
off_url TEXT,
insertion_time TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
UNIQUE(title, source, date)
);

Usage

Start the Service

# Build and start all services
docker-compose up -d

# Start only news processor
docker-compose up -d news-processor

Monitor Logs

# Follow live logs
docker-compose logs -f news-processor

# View specific log file
tail -f logs/news_processor_$(date +%Y-%m-%d).log

Check Service Status

# Check if service is running
docker-compose ps news-processor

# Check metrics endpoint
curl http://localhost:8007/metrics

Data Flow

  1. Initialization: Service connects to PostgreSQL and creates required tables
  2. API Collection: Fetches data from multiple news sources in parallel
  3. Data Processing: Parses and normalizes article data
  4. Deduplication: Checks for existing articles using title, source, and date
  5. Storage: Inserts new articles into PostgreSQL database
  6. Metrics: Updates Prometheus metrics with collection statistics
  7. Notifications: Sends Slack notifications for status and errors
  8. Scheduling: Waits for configured interval before next cycle

Monitoring and Metrics

Prometheus Metrics

The service exposes comprehensive metrics at /metrics:

# View metrics
curl http://localhost:8007/metrics

Key Metrics:

  • news_articles_total: Total articles collected
  • news_articles_by_source: Articles per source
  • news_collection_duration: Time taken for collection
  • news_collection_errors: Error count
  • news_collection_success: Success count

Health Checks

# Check service health
docker inspect sss_news_processor --format='{{.State.Health.Status}}'

# Check metrics endpoint
curl -f http://localhost:8007/metrics

Logging

The service provides structured JSON logging:

{
"timestamp": "2024-01-15T12:00:00Z",
"level": "INFO",
"message": "News collection completed",
"source": "LaunchLibrary2",
"articles_collected": 15,
"articles_new": 3,
"duration": 2.5
}

Error Handling

Common Issues

  1. Database Connection: Service retries connection with exponential backoff
  2. API Failures: Individual API failures don't stop the entire collection
  3. Data Parsing: Invalid data is logged and skipped
  4. Memory Issues: Large datasets are processed incrementally

Troubleshooting

# Check service logs
docker-compose logs news-processor

# Restart service
docker-compose restart news-processor

# Check database connectivity
docker exec -it sss_news_processor python -c "
import psycopg2
try:
conn = psycopg2.connect(
host='postgres',
database='space_data',
user='admin',
password='psswd'
)
print('Database connection successful')
conn.close()
except Exception as e:
print(f'Database connection failed: {e}')
"

Integration

With FastAPI Service

The news processor feeds data to the FastAPI service:

  • News Operations: Articles available through /v1/news/* endpoints
  • Search: Full-text search capabilities
  • Analytics: Time-based analytics and statistics

With Other Services

  • TLE Processor: Independent service, no direct integration
  • Monitoring: Prometheus metrics available for monitoring
  • Logging: Centralized logging through Docker volumes

Performance Considerations

Optimization

  • Parallel Processing: Multiple APIs fetched concurrently
  • Connection Pooling: Database connections are pooled
  • Memory Management: Large datasets processed incrementally
  • Caching: Frequently accessed data cached in memory

Resource Usage

  • CPU: Low to moderate usage during collection
  • Memory: Minimal memory footprint (~100-200MB)
  • Storage: Log files and database storage
  • Network: Periodic API calls to news sources

Security

Data Protection

  • Encryption: Database connections use SSL/TLS
  • Authentication: Database credentials stored as environment variables
  • Access Control: Service runs with minimal required permissions

Network Security

  • Internal Communication: Service communicates only with PostgreSQL
  • External APIs: HTTPS connections to news APIs
  • Firewall: Service exposes only metrics port

Deployment

Docker Configuration

news-processor:
build: ./news-processor
container_name: sss_news_processor
restart: unless-stopped
ports:
- "${NEWS_METRICS_PORT:-8007}:8000"
volumes:
- ./logs:/app/logs
environment:
- DB_HOST=${DB_HOST:-postgres}
- DB_PORT=${DB_PORT:-5432}
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- METRICS_PORT=8000
- SLEEP_INTERVAL=${NEWS_SLEEP_INTERVAL:-1800}
- RETRY_INTERVAL=${NEWS_RETRY_INTERVAL:-300}
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
depends_on:
postgres:
condition: service_healthy

Production Considerations

  1. High Availability: Service automatically restarts on failure
  2. Data Backup: Regular database backups recommended
  3. Monitoring: Set up alerts for collection failures
  4. Scaling: Service can be scaled horizontally if needed
  5. Updates: Regular updates for security and features

API Integration

News Endpoints

The collected news data is accessible through FastAPI endpoints:

  • GET /v1/news/latest: Latest news articles
  • GET /v1/news/timerange: News within time range
  • GET /v1/news/search: Search news articles
  • GET /v1/news/analytics: News analytics and statistics

Data Format

News articles are stored with the following structure:

{
"id": "uuid",
"title": "Article Title",
"short_description": "Article description",
"source": "LaunchLibrary2",
"mission_name": "Mission Name",
"date": "2024-01-15T12:00:00Z",
"url": "https://example.com/article",
"country_code": "US",
"mission_description": "Mission description",
"insertion_time": "2024-01-15T12:00:00Z"
}