BUSINESS DATA API

About Project

API for scraping, parsing and storing information about companies registered in Poland.

Features

Scrape financial documents from official government site in bulk
Fetch detailed information about business from official government API
Store collected data in local repository
Return requested business information throught API endpoint
Store metadata about scraping job status in Redis
Query scraping job status through API endpoint
Scraping jobs are handled by workers that can scale horizontally

Technologies used

requests – HTTP library for making API calls (>=2.32.3,<3.0.0)
bs4 (BeautifulSoup4) – Parsing HTML and XML documents (>=0.0.2,<0.0.3)
lxml – High-performance XML and HTML parser (>=5.4.0,<6.0.0)
redis – Redis client for Python (>=6.2.0,<7.0.0)
rq – Task queue library using Redis (>=2.3.3,<3.0.0)
dotenv – Loads environment variables from .env files (>=0.9.9,<0.10.0)
colorlog – Colored log output for better readability (>=6.9.0,<7.0.0)
sqlalchemy – SQL toolkit and Object-Relational Mapping (ORM) library (>=2.0.41,<3.0.0)
fastapi – High-performance web framework for building APIs (>=0.115.13,<0.116.0)
uvicorn – ASGI server for running FastAPI applications (>=0.34.3,<0.35.0)
asyncpg – Asynchronous PostgreSQL client (>=0.30.0,<0.31.0)
psycopg – PostgreSQL database adapter for Python (>=3.2.9,<4.0.0)
greenlet – Lightweight concurrency primitives for Python (>=3.2.3,<4.0.0)
pydantic[email] – Data parsing and validation with email field support (>=2.11.7,<3.0.0)

Installation

Project requires poetry in order to install all dependecies that are listed in 'pyptoject.toml' 1. Clone git repository:

git clone https://github.com/JakLjk/BUSINESS-DATA-API

Install dependencies

poetry install

Populate .env file (you can use template .env.example, by renaming it to .env) with urls pointing to postgresql server and redis server, which are necessary to store scraped data and to manage workers.
To run the server run the command:

poetry run uvicorn wsgi:app --host <host ip> --port <port>

To run workers go into:

cd ./business_data_api/workers/krsdf_worker.py

To run workers: - responsible for scraping business documents run the command:

poetry run krsdf_worker.py

responsible for scraping and transforming data from official KRS API run command:

poetry run krsapi_worker.py

To run spark stream job responsible for ETL process for raw KRS API DATA run command:

poetry run python run_spark.py

How to use the tool

To get data for specific company you need to know it's KRS number, which is unique number assigned to business entities registered in Poland's National Court Registrer.

Documentation for KRS API and KRS DF endpoints and their corresponding functions can be accessed by opening webpage: <server ip>:<server port>/docs

Additional tools

In automation scripts folder you can find additional tools that can help with populating the database

You can use command

poetry run python run_automation.py

cd /automation_scripts
poetry run check_for_krs_updates --api-url <ip to the business data api> --days <how many days to check>

This automation script can be used in order to scrape changes for krs numbers that were registered in official KRS API registry. Those changes are then send as query to the business data API in order to scrape information about current extract and financial documents. This script can be used to i.e. automatically get daily changes in KRS registry in order to refresh data for all updated entities.

Config file

In order for the tool to work, attached .env.example file has to be filled with values that will tell the script where to point in order to conenct to i.e. Redis queue, PSQL Database resposible for storing raw data, trasnformed data, and log data. The name of the file should then be changed to .env. If project is used in docker stack, some ip addresses can be left the way they are in the .env.example file. For example, REDIS_HOST=redis://redis will point to the addres of redis server container with name 'redis', that is in the same docker network as the rest of the stack

Docker configuration

Attached Makefile has pre-configured commands that allow for running the stack in different configurations, such as: - sudo make run-base - will run docker images that are necessary for backend api to work, such as redis server, fastapi backend, and single worker nodes for scrpaing KRS Financial Documents and KRS API json registrar. - sudo make run-spark-d - will run only spark ETL job in detached mode - sudo make down-spark - will stop and remove only spark container

Future Updates

Add task-level logging functionality to catch unexpected errors during task scraping process
Add analytics endpoint responsible for returning statistical data and analysis for specific business and comparison between businesses
~~Add docker file for composing images for fastapi server and scraping workers~~

Polish Business Data Delivery [API]

Installation & Usage