Infrastructure Documentation¶

Overview¶

The EEMT infrastructure documentation provides comprehensive technical details about the system architecture, deployment configurations, and operational considerations for running EEMT at scale.

Documentation Sections¶

Container Architecture ¶

Detailed documentation of the Docker container ecosystem including: - Multi-layered container design - Image composition and dependencies - Volume management strategies - Network configuration - Security considerations - Performance optimization

Docker Deployment Guide ¶

Step-by-step instructions for deploying EEMT using Docker: - Prerequisites and installation - Quick start procedures - Deployment modes (local, distributed, documentation) - Configuration options - Monitoring and maintenance

Web Interface Architecture ¶

Technical architecture of the FastAPI web application: - System design and components - Request handling flow - Database architecture - Frontend implementation - Container orchestration - Performance considerations

Distributed Deployment ¶

Guide for scaling EEMT across multiple nodes: - Master-worker architecture - HPC integration examples - Container orchestration platforms - Network and storage configuration

Infrastructure Components¶

Container Stack¶

graph TD
    subgraph "EEMT Container Ecosystem"
        A[Ubuntu 24.04 Base]
        B[Scientific Stack Layer]
        C[EEMT Core Layer]
        D1[Web Interface Container]
        D2[Worker Container]
        D3[Documentation Container]

        A --> B
        B --> C
        C --> D1
        C --> D2
        A --> D3
    end

Key Technologies¶

Component	Technology	Version	Purpose
Base OS	Ubuntu	24.04 LTS	Container operating system
Container Runtime	Docker	20.10+	Container execution
Orchestration	Docker Compose	v2.0+	Multi-container management
Web Framework	FastAPI	0.100+	REST API and web interface
Workflow Engine	CCTools	7.8.2	Distributed task execution
GIS Engine	GRASS GIS	8.4+	Geospatial processing
Database	SQLite	3.x	Job tracking and persistence

Deployment Architecture¶

graph LR
    subgraph "User Access"
        U1[Web Browser]
        U2[REST API Client]
        U3[CLI Tools]
    end

    subgraph "Application Layer"
        W[Web Interface]
        A[API Gateway]
    end

    subgraph "Processing Layer"
        M[Master Node]
        W1[Worker 1]
        W2[Worker 2]
        WN[Worker N]
    end

    subgraph "Storage Layer"
        V1[Data Volumes]
        V2[Results Storage]
        DB[Database]
    end

    U1 --> W
    U2 --> A
    U3 --> A
    W --> M
    A --> M
    M --> W1
    M --> W2
    M --> WN
    W1 --> V1
    W2 --> V1
    WN --> V2
    M --> DB

Resource Requirements¶

Minimum Infrastructure¶

Resource	Minimum	Recommended	Notes
CPU	4 cores	8+ cores	More cores enable parallel processing
RAM	8 GB	16+ GB	2GB per worker thread
Storage	50 GB	200+ GB	Depends on dataset size
Network	10 Mbps	100+ Mbps	For climate data downloads
Docker	20.10	Latest stable	Required for container execution

Scaling Considerations¶

Vertical Scaling (Single Node)¶

Increase CPU cores for more parallel workers
Add RAM for larger datasets
Use SSD storage for improved I/O
GPU acceleration (future enhancement)

Horizontal Scaling (Multi-Node)¶

Deploy master node for coordination
Add worker nodes for processing
Use shared storage (NFS, S3)
Implement load balancing

Network Architecture¶

Port Allocations¶

Service	Port	Protocol	Purpose
Web Interface	5000	HTTP	Browser access
Work Queue	9123	TCP	Master-worker communication
Documentation	8000	HTTP	MkDocs server
Monitoring	9090	HTTP	Prometheus (future)
Database	5432	TCP	PostgreSQL (future)

Security Zones¶

graph TB
    subgraph "Public Zone"
        I[Internet]
        LB[Load Balancer]
    end

    subgraph "DMZ"
        WEB[Web Interface]
        API[API Gateway]
    end

    subgraph "Private Zone"
        MASTER[Master Node]
        WORKERS[Worker Pool]
        STORAGE[Storage]
        DB[Database]
    end

    I --> LB
    LB --> WEB
    LB --> API
    WEB --> MASTER
    API --> MASTER
    MASTER --> WORKERS
    WORKERS --> STORAGE
    MASTER --> DB

Storage Architecture¶

Volume Types¶

Volume	Type	Persistence	Purpose
uploads	Bind mount	Persistent	DEM file uploads
results	Bind mount	Persistent	Workflow outputs
temp	tmpfs	Ephemeral	Processing scratch
cache	Bind mount	Semi-persistent	Workflow caching
shared	NFS/S3	Persistent	Distributed storage

Data Flow¶

Input Stage: DEM files uploaded to uploads/ volume
Processing Stage: Temporary data in temp/ volume
Output Stage: Results written to results/ volume
Archive Stage: Results compressed and stored

Monitoring and Observability¶

Health Checks¶

# Docker health check configuration
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Metrics Collection¶

System Metrics: CPU, memory, disk, network
Application Metrics: Job count, processing time, success rate
Container Metrics: Resource usage, restart count
Custom Metrics: EEMT-specific calculations

Logging Strategy¶

Component	Log Location	Retention	Level
Web Interface	`/app/logs/`	7 days	INFO
Workers	Container stdout	24 hours	INFO
System	`/var/log/`	30 days	WARNING
Audit	Database	90 days	ALL

Disaster Recovery¶

Backup Strategy¶

Database Backups: Daily SQLite dumps
Volume Snapshots: Weekly filesystem snapshots
Configuration Backup: Version controlled in Git
Container Images: Registry backups

Recovery Procedures¶

Service Failure: Auto-restart via Docker
Node Failure: Failover to standby node
Data Loss: Restore from backups
Complete Disaster: Rebuild from infrastructure-as-code

Performance Tuning¶

Container Optimization¶

# Resource limits and reservations
deploy:
  resources:
    limits:
      cpus: '4.0'
      memory: 8G
    reservations:
      cpus: '2.0'
      memory: 4G

Network Optimization¶

Use bridge networks for local communication
Enable host networking for performance-critical workers
Implement connection pooling
Configure DNS caching

Storage Optimization¶

Use SSD for temporary processing
Enable compression for results
Implement data deduplication
Regular cleanup of temporary files

Best Practices¶

Deployment¶

Use infrastructure-as-code (Docker Compose, Kubernetes manifests)
Implement blue-green deployments
Maintain staging environments
Automate deployment pipelines

Security¶

Run containers as non-root users
Implement network segmentation
Enable TLS for all communications
Regular security updates

Operations¶

Monitor all critical metrics
Implement automated alerts
Maintain runbooks for common issues
Regular disaster recovery testing

Troubleshooting Guide¶

Common Issues¶

Issue	Cause	Solution
Container won't start	Missing image	Run `docker-compose build`
Out of memory	Resource limits	Increase memory allocation
Slow performance	I/O bottleneck	Use SSD storage
Network timeouts	Firewall rules	Check port accessibility
Job failures	Invalid parameters	Review parameter validation

Diagnostic Commands¶

# Check container status
docker ps -a

# View container logs
docker logs <container_name>

# Inspect container
docker inspect <container_name>

# Monitor resources
docker stats

# Network diagnostics
docker network ls
docker network inspect <network_name>

# Volume inspection
docker volume ls
docker volume inspect <volume_name>

Future Enhancements¶

Planned Infrastructure Improvements¶

Kubernetes Migration: Helm charts and operators
Service Mesh: Istio/Linkerd integration
Observability Stack: Prometheus + Grafana + Loki
CI/CD Pipeline: GitHub Actions + ArgoCD
Multi-Cloud Support: AWS, GCP, Azure deployments

Roadmap¶

Q1 2025: Kubernetes deployment support
Q2 2025: Enhanced monitoring and alerting
Q3 2025: Multi-region deployment
Q4 2025: Serverless execution options