Skip to main content

Overview

AQI prediction is fundamentally a time series forecasting problem with multivariate inputs. This page explores the machine learning architectures supported by AQI Predictor and the rationale behind each approach.
No single architecture is universally best. The optimal choice depends on your data characteristics, prediction horizon, computational resources, and accuracy requirements.

Model Types

Recurrent Neural Networks (RNNs)

Recurrent networks are designed to process sequential data by maintaining internal state (memory) across time steps.
Long Short-Term Memory NetworksLSTMs are the most popular architecture for time series prediction, including AQI forecasting.Architecture:
Input Sequence → LSTM Layers → Dense Layers → Output
[t-n...t-1, t] → [Hidden States] → [Prediction] → [t+k]
Key Components:
  • Forget Gate: Decides what information to discard
  • Input Gate: Decides what new information to store
  • Output Gate: Decides what to output based on cell state
  • Cell State: Long-term memory carrier
Advantages:
  • Captures long-term dependencies (days, weeks)
  • Handles variable-length sequences
  • Well-suited for multiple time horizons
  • Robust to gradient vanishing problem
Best For:
  • Medium to long-term predictions (6-48 hours)
  • When long-term patterns matter
  • Multiple pollutants with complex interactions
Typical Configuration:
lookback_window: 24 hours
lstm_units: [128, 64]
dropout: 0.2
prediction_horizon: 24 hours

Transformer-Based Models

Transformers use self-attention mechanisms to capture relationships across time steps without recurrence.
State-of-the-art for time series forecastingTFT combines several advanced components:
  • Variable Selection: Learns which features are most relevant
  • Static Covariates: Incorporates time-invariant features
  • Multi-Horizon: Predicts multiple time steps simultaneously
  • Attention: Interprets which past time steps influence predictions
Architecture Highlights:
Static Features ────┐
Historical Inputs ──┼─→ Variable Selection → LSTM Encoder → 
Known Future Inputs─┘                                         
                                          → Multi-Head Attention →
                                          → Gated Residual Network →
                                          → Output Layer
Advantages:
  • Superior accuracy on complex datasets
  • Interpretable attention weights
  • Handles multiple types of inputs naturally
  • Built-in uncertainty quantification
Requirements:
  • Large datasets (2+ years recommended)
  • More computational resources
  • Longer training time
  • Hyperparameter tuning is critical
Best For:
  • Production systems with high accuracy needs
  • When interpretability matters
  • Multi-step ahead predictions
  • Rich feature sets with multiple data types
Attention-based sequence modelingPure transformer architecture adapted for time series.Key Mechanisms:
  • Self-Attention: Captures relationships between all time steps
  • Positional Encoding: Injects temporal order information
  • Multi-Head Attention: Multiple attention patterns
Advantages:
  • Parallelizable (faster training than RNNs)
  • No vanishing gradient problems
  • Can capture long-range dependencies
Challenges:
  • Requires more data than RNNs
  • Can overfit on smaller datasets
  • Less inductive bias for temporal structure
Best For:
  • Very large datasets
  • When training time matters
  • Long sequences (> 48 hours lookback)

Hybrid Architectures

Combining multiple architectural components often yields best results.

CNN-LSTM

Convolutional layers extract local patterns, LSTM captures temporal dependencies
Input → 1D CNN → LSTM → Dense → Output
Use Case:
  • Extract local temporal patterns (hourly cycles)
  • Good for multi-sensor or spatial data
  • Reduces sequence length for LSTM

Encoder-Decoder

Separate encoding and decoding phases
Encoder: Input Sequence → Context Vector
Decoder: Context → Multi-step Output
Use Case:
  • Multi-step ahead prediction
  • Sequence-to-sequence mapping
  • When output length differs from input

Attention + LSTM

LSTM with attention mechanismAttention layer helps model focus on most relevant past time steps.Use Case:
  • Improved accuracy over plain LSTM
  • Interpretable predictions
  • Long sequences

Ensemble Models

Combine predictions from multiple modelsAverage or weighted combination of LSTM, GRU, Transformer outputs.Use Case:
  • Maximum accuracy
  • Reduce model variance
  • Production systems

Training Considerations

Input/Output Configuration

Univariate Prediction
Input: PM2.5[t-24:t]
Output: PM2.5[t+1]
  • Predict one pollutant based on its history
  • Simpler, faster training
  • Limited by single variable view
Multivariate Prediction
Input: [PM2.5, PM10, NO2, O3, temp, wind, ...][t-24:t]
Output: PM2.5[t+1]
  • Use multiple variables to predict target
  • Captures cross-pollutant relationships
  • Better accuracy but more complex
  • Recommended approach

Loss Functions

Standard regression loss
MSE = (1/n) Σ(y_true - y_pred)²
Characteristics:
  • Penalizes large errors heavily (quadratic)
  • Sensitive to outliers
  • Most common choice
Use when: Outliers are truly errors and should be heavily penalized
Robust to outliers
MAE = (1/n) Σ|y_true - y_pred|
Characteristics:
  • Linear penalty
  • More robust to outliers
  • All errors weighted equally
Use when: Dataset has many outliers or measurement noise
Hybrid MSE/MAE
Huber = MSE for small errors, MAE for large errors
Characteristics:
  • Best of both worlds
  • Robust but still sensitive to large errors
  • Requires delta parameter tuning
Use when: Want balance between MSE and MAE
For probabilistic predictions
QuantileLoss(τ) = max(τ(y - ŷ), (τ-1)(y - ŷ))
Characteristics:
  • Predicts specific quantiles (e.g., P10, P50, P90)
  • Asymmetric penalty
  • Produces prediction intervals
Use when: Need uncertainty quantification or risk-based decisions

Regularization Techniques

Dropout

Randomly deactivate neurons during training to prevent co-adaptation.Typical rates: 0.2-0.5Apply to: Dense layers, recurrent connections

Recurrent Dropout

Dropout applied to recurrent connections in LSTM/GRU.Typical rates: 0.1-0.3Careful: Too high degrades temporal learning

L1/L2 Regularization

Penalize large weights in loss function.L2 (Ridge): Smooth weight decay L1 (Lasso): Sparse weightsTypical values: 1e-5 to 1e-3

Early Stopping

Stop training when validation loss stops improving.Patience: 10-20 epochsMost effective regularization technique

Model Evaluation

Metrics

Use multiple metrics to get a complete picture of model performance. Different metrics emphasize different aspects.
MetricFormulaInterpretationUse Case
RMSE√(MSE)Same units as targetOverall accuracy, penalizes large errors
MAEMean(|y - ŷ|)Same units, robust to outliersTypical prediction error
MAPEMean(|y - ŷ|/y) × 100Percentage errorRelative accuracy across scales
1 - (SS_res/SS_tot)Variance explained (0-1)Model quality vs baseline
IAIndex of Agreement0-1, how well model matches observationsOverall performance

Validation Strategy

Forward-chaining validation
Train: [────────────────] (2020-2022)
Valid: [────] (2023 Q1)
Test:  [────] (2023 Q2)
  • Respect temporal order (no future data in training)
  • Split chronologically
  • Always use this for time series
Never use random K-fold for time series!

Architecture Selection Guide

Start simple, add complexity only if needed. An LSTM or GRU is sufficient for most AQI prediction tasks.

Quick Recommendations

ScenarioRecommended ArchitectureLookbackHorizon
Quick prototypeGRU (64 units)12h6h
Standard productionLSTM (128, 64 units)24h24h
High-accuracy systemTFT or Ensemble48h48h
Limited computeGRU (single layer)6h3h
Research / state-of-artTransformer + Attention72h72h
Next Steps: With this understanding of architectures, you’re ready to explore the Quick Start Guide to begin training your first model.