News Processor Service
Overview
The News Processor Service is a containerized Python application that continuously fetches space-related news articles from multiple sources and stores them in a PostgreSQL database. It provides comprehensive monitoring, logging, and metrics collection.
What This Service Does
- Multi-source data collection: Fetches news from LaunchLibrary2, RocketLaunch.Live, and Spaceflight News API
- Automatic deduplication: Prevents duplicate articles using title, source, and date constraints
- Structured logging: JSON-formatted logs with structured data
- Prometheus metrics: Comprehensive metrics for monitoring and alerting
- Graceful shutdown: Handles SIGTERM and SIGINT signals properly
- Error handling: Robust error handling with retry mechanisms
- Database integration: Uses existing PostgreSQL service with automatic table creation
- Slack notifications: Real-time alerts for service status, errors, and data collection results
How It Works
- Startup: Service initializes database connection and creates required tables
- Data Collection: Fetches news from multiple APIs in parallel
- Deduplication: Checks for existing articles before insertion
- Storage: Stores articles in PostgreSQL with TimescaleDB features
- Metrics: Collects and exposes Prometheus metrics
- Notifications: Sends Slack notifications for status and errors
- Scheduling: Runs continuously with configurable intervals
Data Sources
LaunchLibrary2
- URL: https://ll.thespacedevs.com/2.2.0/launch/upcoming/
- Data: Upcoming rocket launches with mission details
- Update frequency: Every 30 minutes
RocketLaunch.Live
- URL: https://fdo.rocketlaunch.live/json/launches/next/20
- Data: Next 20 rocket launches with detailed information
- Update frequency: Every 30 minutes
Spaceflight News API
- URL: https://api.spaceflightnewsapi.net/v4/articles/
- Data: General spaceflight news articles
- Update frequency: Every 30 minutes
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
DB_HOST | postgres | PostgreSQL host |
DB_PORT | 5432 | PostgreSQL port |
POSTGRES_DB | space_data | Database name |
POSTGRES_USER | admin | Database username |
POSTGRES_PASSWORD | psswd | Database password |
METRICS_PORT | 8000 | Prometheus metrics port |
SLEEP_INTERVAL | 1800 | Sleep interval between cycles (seconds) |
RETRY_INTERVAL | 300 | Retry interval on errors (seconds) |
SLACK_WEBHOOK_URL | - | Slack webhook URL for notifications |
Database Schema
The service creates and manages the space_news_prod table:
CREATE TABLE space_news_prod (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title VARCHAR(500) NOT NULL,
short_description TEXT,
source VARCHAR(100) NOT NULL,
mission_name VARCHAR(200),
date TIMESTAMP WITH TIME ZONE,
url TEXT,
slug VARCHAR(200),
last_updated TIMESTAMP WITH TIME ZONE,
country_code VARCHAR(10),
mission_description TEXT,
off_url TEXT,
insertion_time TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
UNIQUE(title, source, date)
);
Usage
Start the Service
# Build and start all services
docker-compose up -d
# Start only news processor
docker-compose up -d news-processor
Monitor Logs
# Follow live logs
docker-compose logs -f news-processor
# View specific log file
tail -f logs/news_processor_$(date +%Y-%m-%d).log
Check Service Status
# Check if service is running
docker-compose ps news-processor
# Check metrics endpoint
curl http://localhost:8007/metrics
Data Flow
- Initialization: Service connects to PostgreSQL and creates required tables
- API Collection: Fetches data from multiple news sources in parallel
- Data Processing: Parses and normalizes article data
- Deduplication: Checks for existing articles using title, source, and date
- Storage: Inserts new articles into PostgreSQL database
- Metrics: Updates Prometheus metrics with collection statistics
- Notifications: Sends Slack notifications for status and errors
- Scheduling: Waits for configured interval before next cycle
Monitoring and Metrics
Prometheus Metrics
The service exposes comprehensive metrics at /metrics:
# View metrics
curl http://localhost:8007/metrics
Key Metrics:
news_articles_total: Total articles collectednews_articles_by_source: Articles per sourcenews_collection_duration: Time taken for collectionnews_collection_errors: Error countnews_collection_success: Success count
Health Checks
# Check service health
docker inspect sss_news_processor --format='{{.State.Health.Status}}'
# Check metrics endpoint
curl -f http://localhost:8007/metrics
Logging
The service provides structured JSON logging:
{
"timestamp": "2024-01-15T12:00:00Z",
"level": "INFO",
"message": "News collection completed",
"source": "LaunchLibrary2",
"articles_collected": 15,
"articles_new": 3,
"duration": 2.5
}
Error Handling
Common Issues
- Database Connection: Service retries connection with exponential backoff
- API Failures: Individual API failures don't stop the entire collection
- Data Parsing: Invalid data is logged and skipped
- Memory Issues: Large datasets are processed incrementally
Troubleshooting
# Check service logs
docker-compose logs news-processor
# Restart service
docker-compose restart news-processor
# Check database connectivity
docker exec -it sss_news_processor python -c "
import psycopg2
try:
conn = psycopg2.connect(
host='postgres',
database='space_data',
user='admin',
password='psswd'
)
print('Database connection successful')
conn.close()
except Exception as e:
print(f'Database connection failed: {e}')
"
Integration
With FastAPI Service
The news processor feeds data to the FastAPI service:
- News Operations: Articles available through
/v1/news/*endpoints - Search: Full-text search capabilities
- Analytics: Time-based analytics and statistics
With Other Services
- TLE Processor: Independent service, no direct integration
- Monitoring: Prometheus metrics available for monitoring
- Logging: Centralized logging through Docker volumes
Performance Considerations
Optimization
- Parallel Processing: Multiple APIs fetched concurrently
- Connection Pooling: Database connections are pooled
- Memory Management: Large datasets processed incrementally
- Caching: Frequently accessed data cached in memory
Resource Usage
- CPU: Low to moderate usage during collection
- Memory: Minimal memory footprint (~100-200MB)
- Storage: Log files and database storage
- Network: Periodic API calls to news sources
Security
Data Protection
- Encryption: Database connections use SSL/TLS
- Authentication: Database credentials stored as environment variables
- Access Control: Service runs with minimal required permissions
Network Security
- Internal Communication: Service communicates only with PostgreSQL
- External APIs: HTTPS connections to news APIs
- Firewall: Service exposes only metrics port
Deployment
Docker Configuration
news-processor:
build: ./news-processor
container_name: sss_news_processor
restart: unless-stopped
ports:
- "${NEWS_METRICS_PORT:-8007}:8000"
volumes:
- ./logs:/app/logs
environment:
- DB_HOST=${DB_HOST:-postgres}
- DB_PORT=${DB_PORT:-5432}
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- METRICS_PORT=8000
- SLEEP_INTERVAL=${NEWS_SLEEP_INTERVAL:-1800}
- RETRY_INTERVAL=${NEWS_RETRY_INTERVAL:-300}
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
depends_on:
postgres:
condition: service_healthy
Production Considerations
- High Availability: Service automatically restarts on failure
- Data Backup: Regular database backups recommended
- Monitoring: Set up alerts for collection failures
- Scaling: Service can be scaled horizontally if needed
- Updates: Regular updates for security and features
API Integration
News Endpoints
The collected news data is accessible through FastAPI endpoints:
GET /v1/news/latest: Latest news articlesGET /v1/news/timerange: News within time rangeGET /v1/news/search: Search news articlesGET /v1/news/analytics: News analytics and statistics
Data Format
News articles are stored with the following structure:
{
"id": "uuid",
"title": "Article Title",
"short_description": "Article description",
"source": "LaunchLibrary2",
"mission_name": "Mission Name",
"date": "2024-01-15T12:00:00Z",
"url": "https://example.com/article",
"country_code": "US",
"mission_description": "Mission description",
"insertion_time": "2024-01-15T12:00:00Z"
}