ETL project

Logo

An ETL pipeline that extracts weather and air quality data from public APIs, transforms the data into a clean, analyzable format, and loads it into a PostgreSQL database.

View the Project on GitHub themispap/weather_air_quality_etl_pipline

Weather and Air Quality ETL Pipeline

Project Overview

This project demonstrates an ETL (Extract, Transform, Load) pipeline that integrates weather and air quality data for a specified city over a defined time period. The data is extracted from external APIs, transformed (cleaned and processed), and loaded into a PostgreSQL database. The goal of this project is to provide a clean, well-structured dataset for further analysis or reporting.

Project Structure

├── src
│   ├── extract.py        # Code for extracting data from APIs
│   ├── transform.py      # Code for cleaning and transforming the data
│   ├── load.py           # Code for loading data into PostgreSQL
│   ├── config.py         # Configuration for API keys, database credentials
│   └── utils.py          # Utility functions for the ETL process
├── sql
│   └── schema.sql        # SQL script for creating database schema
├── notebooks
│   └── analysis.ipynb    # Jupyter notebook for data exploration and analysis
├── requirements.txt      # Python dependencies
├── README.md             # Project documentation
├── LICENSE               # License details
└── run_etl.sh            # Executable for ETL

Technologies Used

ETL Pipeline Breakdown

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Data Analysis

How to Run the Project

1. Prerequisites

2. Setup Instructions

  1. Clone the repository:
    git clone https://github.com/your_username/weather_air_quality_etl_pipline.git
    cd weather_air_quality_etl_pipline
    
  2. Create and activate a virtual environment:
    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the required dependencies:
    pip install -r requirements.txt
    
  4. Set up the environment variables:
    • Create a .env file in the root directory with the following keys:
       VISUALCROSSING_API_KEY = <your_visualcrossing_api_key>
       AIRVISUAL_API_KEY = <your_airvisual_api_key>
       DATABASE_URL=postgresql://<username>:<password>@<host>:<port>/<dbname>
      
  5. Set up the PostgreSQL database:
    • Run the SQL script to create the schema:
      psql -U <username> -d <dbname> -f sql/schema.sql
      
  6. Run the ETL pipeline:
    python src/extract.py
    python src/transform.py
    python src/load.py
    

Alternatively,

Make sure the run_etl.sh is executable by running

  chmod +x run_etl.sh

Execute the script with

  ./run_etl.sh

3. Running Analysis

License

This project is licensed under the MIT License. See the LICENSE file for details.