CEPIK Scraping Project

Overview

This project provides DAGs and ETL pipelines for scraping, transforming, and loading data from the CEPIK (Centralna Ewidencja Pojazdów i Kierowców) vehicle registration database in Poland.

The pipelines automate: - Downloading dictionaries (e.g., vehicle brands, fuel types, voivodeships). - Scraping historical and current vehicle registrations. - Persisting raw JSON into HDFS (bronze layer). - Transforming JSON into Delta Lake (silver layer) with a clean schema. - Loading data into PostgreSQL (gold layer) for analytics and reporting.

Features

Airflow DAGs orchestrate scraping and ETL workflows.
Scraping layer:
scraping/dictionaries/get_raw_dictionary.py – fetch CEPIK dictionaries.
scraping/vehicles/get_raw_vehicles.py – fetch vehicle registration data.
Transformation layer:
silver_save_dictionary_voivodeships_to_delta.py – normalize and flatten dictionaries.
silver_save_vehicles_to_delta.py – normalize vehicle data and write to Delta Lake.
gold_save_dictionary_voivodeshipts_to_psql.py – load voivodeship dictionary into PostgreSQL.
DAGs:
dag_Cepik_dictionary_voivodeships.py – scrape and store voivodeship dictionary.
dag_cepik_vehicles.py – scrape current vehicle registrations.
dag_cepik_vehicles_history.py – iterate over years to scrape historical registrations.

Technology Stack

Apache Airflow – workflow orchestration.
Apache Spark + Delta Lake – distributed data processing and storage.
PostgreSQL – relational database for curated data.
HDFS (WebHDFS) – raw data storage.
Python – scraping and ETL logic.
Requests + SSL adapter – robust API integration with CEPIK.

Database Schema

dict_voivodeships – dictionary of voivodeships (from CEPIK API).
dict_voivodeships_coords – voivodeship names, capitals, and geocoordinates.
vehicles – main upserted table of vehicle registrations.
staging_vehicles – temporary table for incoming daily data.

Supporting stored functions and procedures: - upsert_vehicles_from_stage() – merges staging into main vehicles.
- updateVoivodeshioCoordsTableKeys() – enriches dict_voivodeships_coords with CEPIK codes.

Data Flow

Scraping: Airflow DAG fetches JSON from CEPIK API.
Bronze: Raw JSON stored in HDFS.
Silver: Spark jobs flatten JSON into Delta Lake tables.
Gold: Spark jobs load curated data into PostgreSQL.
Analytics: PostgreSQL data can be used in BI tools like Power BI.

Setup

Requirements

Docker / Kubernetes for running Airflow & Spark.
PostgreSQL 14+.
HDFS cluster with WebHDFS enabled.
Python ≥ 3.9 with:
requests
beautifulsoup4
lxml
pyspark
delta-spark

Environment Variables / Airflow Variables

hdfs-data-path – base path in HDFS.
local-data-path – local temp path for scraped files.
task_pool – Airflow pool for API requests.
Connections:
conn-webhdfs – WebHDFS connection.
conn-pg-cepik-db – PostgreSQL connection for CEPIK DB.
spark-conn – Spark connection.

Running

Deploy Airflow DAGs in your Airflow environment.
Ensure Spark jobs are mounted under /opt/airflow/libs/cepik/transformations/.
Start the DAGs from the Airflow UI: - cepik_dictionary_voivodeships_* - cepik_vehicles_current_v* - cepik_vehicles_history_*

Example Usage

Fetch and load voivodeship dictionary into PostgreSQL:
Run dag_Cepik_dictionary_voivodeships.py.
Scrape last 3 days of vehicle registrations:
Run dag_cepik_vehicles.py.
Iterate through history starting from 1935:
Run dag_cepik_vehicles_history.py.

License

MIT License – free to use and adapt.

Polish Car Market Analysis Data Scraper [Python + Airflow]

Installation & Usage