Skip to main content

Overview

Accurate AQI prediction depends on comprehensive, high-quality data from multiple sources. This page covers the types of data required, common sources, and how to structure your datasets for training and inference.
The quality of your predictions is directly related to the quality and completeness of your input data. Missing data, sensor errors, and temporal gaps can significantly impact model performance.

Primary Data Categories

1. Air Quality Measurements

The core of any AQI prediction system is historical air quality data from monitoring stations.

PM2.5

Particulate Matter < 2.5μmFine particles from combustion, vehicle emissions, and industrial processes. Most health-relevant pollutant.
  • Unit: μg/m³
  • Typical range: 0-500+
  • Key driver of AQI

PM10

Particulate Matter < 10μmCoarse particles from dust, pollen, and mold. Includes PM2.5 fraction.
  • Unit: μg/m³
  • Typical range: 0-600+
  • Often correlated with PM2.5

NO2

Nitrogen DioxideProduced by vehicle engines and power plants. Respiratory irritant.
  • Unit: ppb or μg/m³
  • Typical range: 0-200 ppb
  • Traffic-related pollutant

O3

OzoneSecondary pollutant formed by photochemical reactions. Worse in summer.
  • Unit: ppb or μg/m³
  • Typical range: 0-150 ppb
  • Peaks during afternoon

SO2

Sulfur DioxideFrom fossil fuel combustion at power plants and refineries.
  • Unit: ppb or μg/m³
  • Typical range: 0-100 ppb
  • Industrial pollutant

CO

Carbon MonoxideColorless, odorless gas from incomplete combustion.
  • Unit: ppm or mg/m³
  • Typical range: 0-10 ppm
  • Vehicle emissions

2. Meteorological Data

Weather conditions critically influence pollutant dispersion, transformation, and accumulation.
Impact: Disperses pollutants; high winds reduce concentrations
  • Wind Speed: m/s or mph
  • Wind Direction: Degrees (0-360) or cardinal directions
  • Importance: Primary dispersion mechanism
  • Typical patterns: Calm winds → higher pollution; strong winds → lower pollution
Models should capture both speed magnitude and directional components (u and v vectors).
Impact: Affects pollutant chemistry and boundary layer height
  • Unit: °C or °F
  • Importance: Influences ozone formation, vertical mixing
  • Typical patterns:
    • Higher temps → more ozone formation
    • Temperature inversions → trapped pollutants
    • Daily cycles affect mixing height
Impact: Influences particle growth and secondary pollutant formation
  • Unit: Percentage (0-100%)
  • Importance: Affects PM2.5 mass, visibility
  • Typical patterns:
    • High humidity → particle hygroscopic growth
    • Affects chemical reactions in atmosphere
Impact: Indicates weather systems and stability
  • Unit: hPa or mmHg
  • Importance: High pressure systems → stagnant conditions
  • Typical patterns:
    • High pressure → poor ventilation, pollution buildup
    • Low pressure → better mixing, precipitation
Impact: Wet deposition removes pollutants from air
  • Unit: mm or inches
  • Importance: Rain cleanses air, reduces most pollutants
  • Typical patterns: Precipitation events cause sharp drops in PM concentrations
Impact: Drives photochemical reactions
  • Unit: W/m²
  • Importance: Critical for ozone formation
  • Typical patterns: Peak radiation → peak ozone production (with time lag)

3. Temporal Features

Time-based patterns are essential for capturing recurring pollution cycles.
Rush Hour EffectsTraffic-related pollutants (NO2, CO, PM2.5) show clear peaks during morning (7-9 AM) and evening (5-7 PM) rush hours in urban areas.Ozone Diurnal CycleOzone concentrations typically peak in the afternoon (2-4 PM) due to photochemical production.Recommended Features:
  • Hour of day (0-23)
  • Cyclical encoding: sin(2π × hour/24), cos(2π × hour/24)

4. Spatial Features (Optional)

For multi-station or spatial models, location-based features can improve predictions.
  • Station coordinates: Latitude, longitude, elevation
  • Land use: Urban density, industrial zones, green space
  • Traffic density: Proximity to major roads, vehicle counts
  • Topography: Terrain features affecting air flow
  • Population density: Emission source indicator

Common Data Sources

Data availability and quality vary by region. Check local environmental agencies for official monitoring data.

Air Quality Data Sources

SourceCoverageAccessNotes
EPA AirNow (US)United StatesAPI, bulk downloadOfficial US air quality data
OpenAQGlobalAPI, open dataAggregated global air quality data
CPCB (India)IndiaWeb portal, APICentral Pollution Control Board
EEAEuropeAPI, downloadsEuropean Environment Agency
PurpleAirGlobalAPILow-cost sensor network (requires calibration)
AQICNGlobalAPIWorld Air Quality Index Project

Meteorological Data Sources

SourceCoverageAccessResolution
NOAAUS, GlobalAPI, FTPHourly observations
OpenWeatherMapGlobalAPICurrent & forecast
Weather UndergroundGlobalAPIStation-level data
ERA5 ReanalysisGlobalDownloadHourly, gridded
ECMWFGlobalAPIForecast data
Local Weather ServicesRegionalVariesOften most accurate

Data Requirements

Minimum Requirements

For Training

  • At least 1-2 years of historical data
  • Hourly or sub-hourly resolution
  • At least PM2.5 or dominant local pollutant
  • Basic meteorology (temp, wind, humidity)
  • < 20% missing data

For Inference

  • Recent pollutant measurements (lookback window)
  • Current meteorological conditions
  • Weather forecast (for longer horizons)
  • Same features used in training
Pre-processing validation:
  • Check for physically impossible values (negative concentrations)
  • Identify and flag sensor malfunctions (constant values, sudden spikes)
  • Validate against neighboring stations
  • Check for temporal consistency
  • Document data quality flags
Strategies by scenario:
  • Small gaps (< 3 hours): Linear interpolation
  • Moderate gaps (3-12 hours): Use neighboring stations or advanced imputation
  • Large gaps (> 12 hours): Exclude from training or use with caution
  • Systematic missing data: May indicate sensor issues; investigate
Always document imputation methods and mark imputed values.
Temporal synchronization:
  • Align all data sources to common timestamps
  • Handle timezone conversions carefully
  • Account for averaging periods (1-hour mean, 8-hour mean, etc.)
  • Match meteorological data to station locations
  • Consider temporal lags between cause and effect
Normalization approaches:
  • Min-Max Scaling: Scale to [0,1] or [-1,1] range
  • Standardization: Zero mean, unit variance (z-score)
  • Robust Scaling: Use median and IQR (handles outliers)
Important: Save scaling parameters from training data to apply consistently during inference.

Data Format

AQI Predictor expects data in structured formats:
# Example CSV structure
timestamp,pm25,pm10,no2,o3,so2,co,temp,humidity,wind_speed,wind_dir,pressure
2024-01-01 00:00:00,35.2,58.1,28.3,22.1,5.2,0.6,8.5,72,2.3,180,1013.2
2024-01-01 01:00:00,38.7,62.3,31.2,19.8,5.8,0.7,8.1,75,1.8,165,1013.5
...
Next Steps: Learn how this data feeds into machine learning models in Model Architecture.