Jakub’s Projects
Apache Spark + Hadoop HDFS DE Cluster

Apache Spark + Hadoop HDFS DE Cluster

This project spins up a local HDFS + Spark cluster for development, testing, and learning. Using Docker Compose, it runs Hadoop NameNode and DataNodes together with a Spark Master, Workers, and History Server, giving you a complete big data playground on your machine. The setup lets you experiment with distributed storage and computation, submit Spark jobs against HDFS, and monitor everything through built-in web UIs. It’s lightweight, easy to start with make run, and provides a practical way to explore Hadoop–Spark workflows without complex infrastructure.


Created: 2025-09-25 Updated: 2025-09-26

Installation & Usage

Hadoop + Spark Cluster with Docker Compose

Hadoop Spark Docker Compose

This project provides a local Hadoop HDFS + Apache Spark cluster using Docker Compose.
It is meant for development, testing, and learning, it is not production ready.


Services

Hadoop

  • hadoop-namenode (ports: 9870) – HDFS master node & UI
  • hadoop-datanode-1, hadoop-datanode-2 – HDFS workers storing blocks

Spark

  • spark-master (ports: 8080, 7077) – Spark master node & UI
  • spark-worker – Spark worker(s) attached to the cluster
  • spark-history (port: 18080) – Spark History Server for job monitoring

Setup & Usage

1. Start Hadoop (NameNode + DataNodes)

make build-docker-image
  • Spins up Hadoop containers
  • Waits until HDFS is healthy and available

2. Create Spark filesystem directories in HDFS

make create-spark-fs

Creates: - /sparkfs/eventlog (for Spark event logs, mode 1777) - /sparkfs/warehouse (for Spark SQL, mode 775)

3. Start Spark Only (Master, Workers, History Server) [Hadoop has to be alreaty initialised]

make run-spark
  • Starts Spark master, 2 workers (--scale spark-worker=2), and history server
  • Mounts configuration from ./configs/spark/spark-defaults.conf

4. Run full stack (Hadoop + Spark)

make run

After startup, UIs are available: - Hadoop NameNode UI http://localhost:9870 - Spark Master UI http://localhost:8080 - Spark History Server http://localhost:18080

Development Commands

1. Stop cluster

make down

2. Stop and clean all volumes

make down-clean

3. Open shell in NameNode

make dev-hdfs-bash

4. Run Hadoop only (debug mode)

make dev-run-hadoop

5. Run Spark only (debug mode)

make dev-run-spark

Notes

  • Hadoop data is persisted in Docker named volumes (hadoop_namenode, hadoop_datanode_1, hadoop_datanode_2).
  • Spark logs and history are stored inside HDFS under /sparkfs/.
  • Configurations are mounted from ./configs/.
  • You can scale workers easily: bash docker compose up -d --scale spark-worker=3