Jakub’s Projects

Projects

Polish Car Market Analysis Data Scraper [Python + Airflow]
Polish Car Market Analysis Data Scraper [Python + Airflow]

This project delivers a complete data engineering pipeline for the CEPIK (Centralna Ewidencja Pojazdów i Kierowców) vehicle registration database in Poland. It automates the process of scraping raw data directly from the CEPIK API, handling both dictionaries (such as vehicle brands, fuel types, voivodeships) and vehicle registration records (historical and current). The system is built on top of Apache Airflow for orchestration and scheduling, Apache Spark with Delta Lake for scalable data transformations, and HDFS for distributed raw storage. Transformed and curated data is stored in PostgreSQL, where it is ready to be consumed by analytical and BI tools.

Read More
Polish Car Market Analysis [PowerBI]
Polish Car Market Analysis [PowerBI]

This project presents an interactive Power BI report built on Poland’s CEPIK database, offering a clear view of vehicle registration trends. It combines national and regional insights with intuitive filters, allowing users to explore the automotive landscape across years, categories, and fuel types. Key features include: Overview statistics such as median production year, average engine power and displacement, and counts of historic vehicles. Geographical analysis of registrations by voivodeship with map visualizations. Vehicle categories and fuel types, including breakdowns for passenger, cargo, and electric vehicles. Top brands ranking with comparative volumes. Advanced analytics like the relationship between engine size and power, fuel consumption trends, and power-to-weight ratios over time. Designed for exploration and decision-making, the dashboard transforms raw CEPIK data into an accessible tool for monitoring the evolution of Poland’s vehicle fleet.

Read More
Polish Business Data Delivery [API]
Polish Business Data Delivery [API]

This project delivers a FastAPI-based service for scraping, parsing, and serving company information from Poland’s National Court Register (KRS). It automates the collection of official extracts and financial documents, stores them in PostgreSQL, and exposes structured results through simple API endpoints. The system is built for scalability and automation: scraping jobs run on Redis-backed workers, job metadata is tracked in Redis queues, and Spark streaming jobs handle ETL for raw KRS API data. Users can query company details by KRS number, check the status of scraping tasks, or schedule automatic updates to keep the database in sync with the latest registry changes. With Docker Compose and Poetry, the stack is easy to set up locally or in a containerized environment. It’s a flexible foundation for anyone who needs reliable access to official Polish business data — whether for analytics, monitoring, or integration with other systems.

Read More
Grafana + Prometheus Docker Compose
Grafana + Prometheus Docker Compose

This project provides a ready-to-use monitoring stack with Prometheus and Grafana running in Docker Compose. It comes preconfigured with cAdvisor (for Docker container metrics) and Node Exporter (for system health), so you can start collecting and visualizing metrics right away. Prometheus scrapes exporters based on a local configuration file, and you can extend it easily by adding new jobs. Grafana connects to Prometheus out of the box, letting you import pre-built dashboards from the community and explore metrics in real time. External Prometheus port is mapped to 9091 to avoid conflicts with existing services.

Read More
Apache Spark + Hadoop HDFS DE Cluster
Apache Spark + Hadoop HDFS DE Cluster

This project spins up a local HDFS + Spark cluster for development, testing, and learning. Using Docker Compose, it runs Hadoop NameNode and DataNodes together with a Spark Master, Workers, and History Server, giving you a complete big data playground on your machine. The setup lets you experiment with distributed storage and computation, submit Spark jobs against HDFS, and monitor everything through built-in web UIs. It’s lightweight, easy to start with make run, and provides a practical way to explore Hadoop–Spark workflows without complex infrastructure.

Read More
PostgreSQL with Change Data Capture Support [Kafka + Debezium]
PostgreSQL with Change Data Capture Support [Kafka + Debezium]

This project sets up a lightweight environment for real-time Change Data Capture from PostgreSQL into Kafka. Using Docker Compose, it runs PostgreSQL with Debezium, Kafka, Zookeeper, and Kafka Connect, so any row-level change in the database is streamed as an event into Kafka topics. The stack is useful for testing data pipelines and stream processing tools like Spark: you can register Debezium connectors, watch topics update in real time, and experiment with CDC patterns locally without complex infrastructure.

Read More
Airflow Local Docker Setup
Airflow Local Docker Setup

This project provides a minimal Apache Airflow environment for local development and testing. It runs a single-node Airflow cluster with PostgreSQL as the metadata database, packaged via Docker Compose. The setup is lightweight — using the LocalExecutor without Celery or Redis — but can be extended by building a custom Airflow image that installs additional Python libraries. It’s designed as a quick, resource-friendly way to experiment with Airflow workflows on your own machine.

Read More
Domain Availability Checker Tool [WHOIS]
Domain Availability Checker Tool [WHOIS]

This toolkit checks domains straight against registry WHOIS (no third-party whois libs), handles internationalized names, and batch-processes dictionaries with results persisted in PostgreSQL. Use the CLI for quick lookups or the Python API/Airflow DAG for scheduled, reproducible runs.

Read More