Skip to content

Distributed Deployment

The EEMT workflow manager supports distributed execution across multiple virtual machines, bare metal servers, HPC systems, and HTC environments using a master/worker architecture.

Overview

The distributed deployment consists of:

  • Master/Foreman Node: Coordinates workflow execution, manages job queue, and distributes tasks
  • Worker Nodes: Execute individual workflow tasks using containerized environments
  • Shared Storage: Optional shared filesystem for large datasets
graph TB
    subgraph "Master Node"
        WM[Workflow Manager]
        WQ[Work Queue Master]
        WI[Web Interface]
        WM --> WQ
        WM --> WI
    end

    subgraph "Worker Nodes"
        W1[Worker 1<br/>VM/Server]
        W2[Worker 2<br/>HPC Node]
        W3[Worker 3<br/>Container]
        W4[Worker N<br/>Bare Metal]
    end

    subgraph "Storage"
        FS[Shared Filesystem<br/>NFS/CIFS/S3]
        LF[Local Storage]
    end

    WQ -.-> W1
    WQ -.-> W2
    WQ -.-> W3
    WQ -.-> W4

    W1 --> FS
    W2 --> LF
    W3 --> FS
    W4 --> LF

Deployment Types

1. Multi-VM Deployment

Use Case: Research clusters, cloud environments, multi-institutional collaboration

Setup: - Master VM with web interface and workflow coordination - Multiple worker VMs with container runtime - Shared storage via NFS, CIFS, or cloud storage

Benefits: - Easy scaling by adding VM instances - Isolated execution environments - Cloud-native deployments

2. HPC Integration

Use Case: University HPC clusters, national computing facilities

Setup: - Master node on login/head node - Workers on compute nodes via SLURM/PBS job submission - Shared filesystem (Lustre, GPFS, NFS)

Benefits: - Leverage existing HPC infrastructure - High-performance interconnects - Large-scale parallel execution

3. HTC Grid Computing

Use Case: Open Science Grid, campus grids, volunteer computing

Setup: - Master on submit node - Workers on grid compute nodes - Container-based execution for portability

Benefits: - Massive scale-out capability - Opportunistic resource usage - Heterogeneous resource pools

Prerequisites

Master Node Requirements

  • OS: Linux (Ubuntu 20.04+, CentOS 8+, RHEL 8+)
  • CPU: 4+ cores
  • Memory: 8GB+ RAM
  • Disk: 100GB+ storage
  • Network: Public IP or accessible from worker nodes
  • Software:
  • Docker or Podman
  • CCTools (Makeflow + Work Queue)
  • Python 3.11+

Worker Node Requirements

  • OS: Linux with container runtime OR CCTools installed
  • CPU: 4+ cores (configurable)
  • Memory: 8GB+ RAM (configurable)
  • Disk: 50GB+ temporary storage (configurable)
  • Network: Access to master node
  • Software:
  • Docker/Podman
  • CCTools work_queue_worker
  • Optional: EEMT container image (for offline environments)

Installation

Quick Start with Docker Compose

The easiest way to deploy a distributed EEMT cluster:

# Clone repository
git clone https://github.com/tyson-swetnam/eemt.git
cd eemt

# Start distributed cluster (master + workers)
docker-compose --profile distributed up

# Scale workers as needed
docker-compose --profile distributed up --scale eemt-worker-2=3 --scale eemt-worker-3=2

# Access web interface at http://localhost:5000
# Monitor cluster at http://localhost:5000/monitor

Distributed Docker Deployment

1. Master Node Container
# Start master node with web interface
docker run -d \
  --name eemt-master \
  -p 5000:5000 \
  -p 9123:9123 \
  -v $(pwd)/data:/app/data \
  -e EEMT_MODE=master \
  -e WORK_QUEUE_PORT=9123 \
  -e WORK_QUEUE_PROJECT=EEMT-Cluster \
  eemt-web
2. Worker Node Containers
# Start worker containers on same host
docker run -d \
  --name eemt-worker-1 \
  -e MASTER_HOST=eemt-master \
  -e MASTER_PORT=9123 \
  -e WORKER_CORES=4 \
  -e WORKER_MEMORY=8G \
  --network container:eemt-master \
  eemt:ubuntu24.04

# Start workers on remote hosts
docker run -d \
  --name eemt-worker-remote \
  -e MASTER_HOST=MASTER_IP_ADDRESS \
  -e MASTER_PORT=9123 \
  -e WORKER_CORES=8 \
  -e WORKER_MEMORY=16G \
  eemt:ubuntu24.04
3. Custom Docker Compose

Create docker-compose.production.yml:

version: '3.8'
services:
  eemt-master:
    image: eemt-web:latest
    ports:
      - "5000:5000"
      - "9123:9123"
    volumes:
      - /shared/eemt/data:/app/data
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - EEMT_MODE=master
      - WORK_QUEUE_PROJECT=EEMT-Production
      - MAX_WORKERS=100
    restart: unless-stopped

  eemt-worker:
    image: eemt:ubuntu24.04
    environment:
      - MASTER_HOST=eemt-master
      - MASTER_PORT=9123
      - WORKER_CORES=8
      - WORKER_MEMORY=16G
      - WORKER_DISK=100G
    depends_on:
      - eemt-master
    restart: unless-stopped
    deploy:
      replicas: 5

Traditional Installation

For environments requiring manual setup:

1. Install Dependencies

# Ubuntu/Debian
sudo apt update
sudo apt install -y docker.io docker-compose python3-pip git

# CentOS/RHEL
sudo dnf install -y docker docker-compose python3-pip git
sudo systemctl enable --now docker

# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in for group changes to take effect

2. Clone and Build

# Clone repository
git clone https://github.com/tyson-swetnam/eemt.git
cd eemt

# Build containers
docker build -t eemt:ubuntu24.04 -f docker/ubuntu/24.04/Dockerfile .
docker build -t eemt-web -f docker/web-interface/Dockerfile .

3. Start Master Node

# Using Docker
docker run -d \
  --name eemt-master \
  -p 5000:5000 \
  -p 9123:9123 \
  -v $(pwd)/data:/app/data \
  eemt-web

# Using Python scripts (requires manual CCTools installation)
python scripts/start-master.py \
    --work-dir /data/eemt-master \
    --port 9123 \
    --project EEMT-Production \
    --web-interface \
    --web-port 5000

Worker Node Setup

1. Single Worker Container
# Connect to remote master
docker run -d \
  --name eemt-worker \
  -e MASTER_HOST=MASTER_IP_ADDRESS \
  -e MASTER_PORT=9123 \
  -e WORKER_CORES=8 \
  -e WORKER_MEMORY=16G \
  -e WORKER_DISK=100G \
  eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
    --master-host $MASTER_HOST \
    --master-port $MASTER_PORT \
    --cores $WORKER_CORES \
    --memory $WORKER_MEMORY \
    --disk $WORKER_DISK
2. Multiple Workers (Docker Swarm)
# Initialize Docker Swarm (on manager node)
docker swarm init --advertise-addr MANAGER_IP

# Create worker service
docker service create \
  --name eemt-workers \
  --replicas 10 \
  --env MASTER_HOST=MASTER_IP \
  --env MASTER_PORT=9123 \
  --env WORKER_CORES=4 \
  --env WORKER_MEMORY=8G \
  eemt:ubuntu24.04

# Join worker nodes to swarm
docker swarm join --token WORKER_TOKEN MANAGER_IP:2377
3. Kubernetes Deployment

Create k8s-eemt-workers.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: eemt-workers
spec:
  replicas: 20
  selector:
    matchLabels:
      app: eemt-worker
  template:
    metadata:
      labels:
        app: eemt-worker
    spec:
      containers:
      - name: eemt-worker
        image: eemt:ubuntu24.04
        env:
        - name: MASTER_HOST
          value: "eemt-master-service"
        - name: MASTER_PORT
          value: "9123"
        - name: WORKER_CORES
          value: "4"
        - name: WORKER_MEMORY
          value: "8G"
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
          limits:
            memory: "16Gi"
            cpu: "8"
        command: ["python", "/scripts/start-worker.py"]
        args: ["--master-host", "$(MASTER_HOST)", "--master-port", "$(MASTER_PORT)"]

Traditional Worker Setup

For environments without container orchestration:

1. Install Dependencies
# Ubuntu/Debian
sudo apt update
sudo apt install -y docker.io python3-pip

# CentOS/RHEL  
sudo dnf install -y docker python3-pip
sudo systemctl enable --now docker

# For HPC environments without root access
wget https://github.com/cooperative-computing-lab/cctools/releases/download/release%2F7.8.2/cctools-7.8.2-x86_64-linux.tar.gz
tar xzf cctools-7.8.2-x86_64-linux.tar.gz
export PATH=$PWD/cctools-7.8.2-x86_64-linux/bin:$PATH
2. Pull Container Image
# Option A: Pull from registry
docker pull eemt:ubuntu24.04

# Option B: Transfer from master node
docker save eemt:ubuntu24.04 | gzip > eemt-container.tar.gz
scp eemt-container.tar.gz worker-node:
# On worker node:
gunzip -c eemt-container.tar.gz | docker load
3. Start Worker Node
# Containerized worker
docker run -d \
  --name eemt-worker \
  -v /tmp:/tmp \
  -e MASTER_HOST=MASTER_IP \
  -e MASTER_PORT=9123 \
  eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
    --master-host $MASTER_HOST \
    --master-port $MASTER_PORT \
    --cores 8 --memory 16G

# Native worker (requires CCTools installation)
python scripts/start-worker.py \
    --master-host MASTER_IP \
    --master-port 9123 \
    --cores 8 --memory 16G

HPC Integration Examples

SLURM Docker Integration

Option 1: Containerized Workers

Create eemt-docker-worker.sbatch:

#!/bin/bash
#SBATCH --job-name=eemt-docker-worker
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --time=4:00:00
#SBATCH --partition=compute

# Load Docker module
module load docker/20.10

# Set master information
export MASTER_HOST=login-node.hpc.example.edu
export MASTER_PORT=9123

# Start containerized worker
docker run --rm \
  --name eemt-worker-$SLURM_JOB_ID \
  -e MASTER_HOST=$MASTER_HOST \
  -e MASTER_PORT=$MASTER_PORT \
  -e WORKER_CORES=$SLURM_CPUS_PER_TASK \
  -e WORKER_MEMORY=${SLURM_MEM_PER_NODE}M \
  -v $TMPDIR:/tmp \
  eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
    --master-host $MASTER_HOST \
    --master-port $MASTER_PORT \
    --cores $SLURM_CPUS_PER_TASK \
    --memory ${SLURM_MEM_PER_NODE}M \
    --work-dir /tmp/eemt-worker-$SLURM_JOB_ID

Submit Multiple Workers

# Submit 20 Docker workers
for i in {1..20}; do
    sbatch eemt-docker-worker.sbatch
done

# Submit with array jobs
sbatch --array=1-20 eemt-docker-worker.sbatch

PBS/Torque Docker Example

Create eemt-docker-worker.pbs:

#!/bin/bash
#PBS -N eemt-docker-worker
#PBS -l nodes=1:ppn=8
#PBS -l mem=32gb
#PBS -l walltime=04:00:00
#PBS -q compute

cd $PBS_O_WORKDIR

# Load Docker module
module load docker/20.10

# Start containerized worker
docker run --rm \
  --name eemt-worker-$PBS_JOBID \
  -e MASTER_HOST=$MASTER_HOST \
  -e MASTER_PORT=9123 \
  -e WORKER_CORES=8 \
  -e WORKER_MEMORY=32G \
  -v $TMPDIR:/tmp \
  eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
    --master-host $MASTER_HOST \
    --master-port 9123 \
    --cores 8 \
    --memory 32G \
    --work-dir /tmp/eemt-worker-$PBS_JOBID

LSF Docker Example

Create eemt-docker-worker.lsf:

#!/bin/bash
#BSUB -J eemt-docker-worker
#BSUB -n 8
#BSUB -R "rusage[mem=4096]"
#BSUB -W 4:00
#BSUB -o eemt-worker.%J.out
#BSUB -e eemt-worker.%J.err

# Load Docker
module load docker

# Start containerized worker
docker run --rm \
  --name eemt-worker-$LSB_JOBID \
  -e MASTER_HOST=$MASTER_HOST \
  -e MASTER_PORT=9123 \
  -e WORKER_CORES=$LSB_MAX_NUM_PROCESSORS \
  -e WORKER_MEMORY=${LSB_MAX_MEM_POLICY}M \
  -v $TMPDIR:/tmp \
  eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
    --master-host $MASTER_HOST \
    --master-port 9123 \
    --cores $LSB_MAX_NUM_PROCESSORS \
    --memory ${LSB_MAX_MEM_POLICY}M

Monitoring and Management

Docker-Based Monitoring

Master Node Monitoring

# Monitor master container
docker logs -f eemt-master

# Check Work Queue status
docker exec eemt-master work_queue_status -M EEMT

# Web interface monitoring
firefox http://master-node:5000/monitor

# Container resource usage
docker stats eemt-master

Worker Monitoring

# List all worker containers
docker ps --filter name=eemt-worker

# Monitor worker logs
docker logs -f eemt-worker-1

# Check worker connectivity (container)
docker run --rm eemt:ubuntu24.04 \
  python /scripts/start-worker.py \
  --master-host MASTER_IP --dry-run

# Resource monitoring across containers
docker stats $(docker ps --filter name=eemt-worker --format "{{.Names}}")

Cluster Status Commands

# Show detailed cluster status
docker exec eemt-master work_queue_status -M EEMT --verbose

# Show worker locations and resources
docker exec eemt-master work_queue_status -M EEMT --workers --resources

# Historical statistics
docker exec eemt-master work_queue_status -M EEMT --stats

# Container orchestration status
docker service ls  # Docker Swarm
kubectl get pods   # Kubernetes

Docker Compose Monitoring

# View all services
docker-compose ps

# Follow logs for specific service
docker-compose logs -f eemt-master
docker-compose logs -f eemt-worker

# Scale workers dynamically
docker-compose up --scale eemt-worker=10

# View resource usage
docker-compose top

Configuration Files

Master Configuration

Create /etc/eemt/master.conf:

[master]
port = 9123
project = EEMT-Production
max_workers = 100
worker_timeout = 300
heartbeat_interval = 30

[web_interface]
enabled = true
port = 5000
host = 0.0.0.0

[security]
password_file = /etc/eemt/cluster.password
ssl_cert = /etc/eemt/ssl/cert.pem
ssl_key = /etc/eemt/ssl/key.pem

[storage]
shared_dir = /shared/eemt
results_dir = /data/eemt/results
uploads_dir = /data/eemt/uploads

Worker Configuration

Create /etc/eemt/worker.conf:

[worker]
master_host = cluster-master.example.com
master_port = 9123
cores = auto
memory = auto
disk = auto

[runtime]
work_dir = /tmp/eemt-worker
container_runtime = docker
reconnect_attempts = 5
reconnect_delay = 10

[logging]
level = INFO
file = /var/log/eemt/worker.log

Security Considerations

Network Security

  • Firewall Rules: Open Work Queue port (9123) between master and workers
  • VPN/Private Networks: Use private networks for sensitive data
  • SSH Tunneling: Secure connections through SSH tunnels if needed
# SSH tunnel example for secure connection
ssh -L 9123:master-internal:9123 gateway.example.com
python scripts/start-worker.py --master-host localhost --master-port 9123

Authentication

  • Password Protection: Use strong Work Queue passwords
  • Certificate-based: Implement SSL/TLS for web interface
  • Network Isolation: Restrict access to trusted networks
# Generate secure password
openssl rand -base64 32 > ~/.eemt-makeflow-password
chmod 600 ~/.eemt-makeflow-password

Troubleshooting

Common Issues

Master Not Starting

# Check port availability
netstat -an | grep 9123

# Check CCTools installation
work_queue_worker --version
makeflow --version

# Check Docker availability
docker info

# View detailed logs
python scripts/start-master.py --log-level DEBUG

Workers Not Connecting

# Test connectivity
telnet master-host 9123
ping master-host

# Check firewall
sudo ufw status
sudo iptables -L

# Verify CCTools
which work_queue_worker
work_queue_worker --help

# Test with verbose logging
python scripts/start-worker.py --master-host MASTER --log-level DEBUG

Performance Issues

# Monitor system resources
htop
iostat -x 1
df -h

# Check network bandwidth
iperf3 -c master-host

# Monitor container resource usage
docker stats

# Work Queue statistics
work_queue_status -M EEMT --resources --verbose

Container Issues

# Check Docker daemon
sudo systemctl status docker

# Pull latest container
docker pull eemt:ubuntu24.04

# Test container manually
docker run -it --rm eemt:ubuntu24.04 /bin/bash

# Check container logs
docker logs CONTAINER_ID

Best Practices

Resource Management

  • CPU Allocation: Leave 1-2 cores for system processes
  • Memory Limits: Reserve 2GB for OS and Docker
  • Disk Space: Ensure adequate temporary space for workflows
  • Network: Use high-bandwidth connections for large datasets

Deployment Strategy

  • Start Small: Begin with 2-3 workers, scale gradually
  • Monitor Performance: Track task completion rates and resource usage
  • Load Balancing: Distribute workers across different hardware types
  • Fault Tolerance: Plan for worker failures and network interruptions

Maintenance

  • Regular Updates: Keep container images and CCTools updated
  • Log Rotation: Implement log rotation to manage disk space
  • Monitoring: Set up alerts for master/worker failures
  • Backup: Regular backup of configuration and critical data

Advanced Configurations

Multi-Master Setup

For high availability, deploy multiple master nodes:

# Master 1
python scripts/start-master.py --port 9123 --project EEMT-Primary

# Master 2 (backup)
python scripts/start-master.py --port 9124 --project EEMT-Backup

# Workers connect to both
python scripts/start-worker.py --master-host master1 --master-port 9123 &
python scripts/start-worker.py --master-host master2 --master-port 9124 &

Heterogeneous Workers

Deploy different worker types for different tasks:

# CPU-intensive workers
python scripts/start-worker.py --master-host MASTER --cores 16 --memory 64G

# Memory-intensive workers  
python scripts/start-worker.py --master-host MASTER --cores 8 --memory 128G

# GPU workers (future)
python scripts/start-worker.py --master-host MASTER --cores 8 --memory 32G --gpu 1

Container Orchestration

Integration with Kubernetes, Docker Swarm, or other orchestrators:

# Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eemt-workers
spec:
  replicas: 10
  selector:
    matchLabels:
      app: eemt-worker
  template:
    metadata:
      labels:
        app: eemt-worker
    spec:
      containers:
      - name: eemt-worker
        image: eemt:ubuntu24.04
        command: ["python", "/scripts/start-worker.py"]
        args: ["--master-host", "eemt-master-service", "--master-port", "9123"]
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
          limits:
            memory: "16Gi"
            cpu: "8"

This distributed deployment guide provides comprehensive coverage for deploying EEMT across various computing environments while maintaining security, performance, and reliability.