Hadoop + Spark Cluster with Docker Compose

This project provides a local Hadoop HDFS + Apache Spark cluster using Docker Compose.
It is meant for development, testing, and learning, it is not production ready.
Services
Hadoop
- hadoop-namenode (ports:
9870
) – HDFS master node & UI
- hadoop-datanode-1, hadoop-datanode-2 – HDFS workers storing blocks
Spark
- spark-master (ports:
8080
, 7077
) – Spark master node & UI
- spark-worker – Spark worker(s) attached to the cluster
- spark-history (port:
18080
) – Spark History Server for job monitoring
Setup & Usage
1. Start Hadoop (NameNode + DataNodes)
make build-docker-image
- Spins up Hadoop containers
- Waits until HDFS is healthy and available
2. Create Spark filesystem directories in HDFS
make create-spark-fs
Creates:
- /sparkfs/eventlog
(for Spark event logs, mode 1777)
- /sparkfs/warehouse
(for Spark SQL, mode 775)
3. Start Spark Only (Master, Workers, History Server) [Hadoop has to be alreaty initialised]
make run-spark
- Starts Spark master, 2 workers (
--scale spark-worker=2
), and history server
- Mounts configuration from
./configs/spark/spark-defaults.conf
4. Run full stack (Hadoop + Spark)
make run
After startup, UIs are available:
- Hadoop NameNode UI http://localhost:9870
- Spark Master UI http://localhost:8080
- Spark History Server http://localhost:18080
Development Commands
1. Stop cluster
make down
2. Stop and clean all volumes
make down-clean
3. Open shell in NameNode
make dev-hdfs-bash
4. Run Hadoop only (debug mode)
make dev-run-hadoop
5. Run Spark only (debug mode)
make dev-run-spark
Notes
- Hadoop data is persisted in Docker named volumes (hadoop_namenode, hadoop_datanode_1, hadoop_datanode_2).
- Spark logs and history are stored inside HDFS under /sparkfs/.
- Configurations are mounted from ./configs/.
- You can scale workers easily:
bash
docker compose up -d --scale spark-worker=3