Skip to content

Architecture Overview

The system is designed as a modular monorepo using uv workspaces, with Dagster orchestrating the data pipeline and MLflow tracking experiments.

Core Components

  • Environment & Modularity: uv workspace (Monorepo). Python 3.14. Individual components must be pip-installable with expressive type hints.
  • Data Processing: Polars. Chosen for extreme speed and its native join_asof functionality to guarantee no future-data leakage during feature engineering.
    • Centralized Data Preparation: All data entering ML models passes through a centralized preparation step to enforce strict data contracts, handle missing entities, and ensure consistency between training and inference.
  • Storage: Delta Lake on cloud object storage for the power data and forecasts. Upgraded from raw Parquet to provide ACID transactions, time-travel, and efficient partitioning (by year/month) for K-fold cross-validation. The weather data will use raw Parquet with well-defined naming conventions (we use raw Parquet for NWP data because Delta Lake doesn't support uint8).
  • Orchestration: Dagster. Manages the pipeline via Software-Defined Assets (SDAs). Partitioned by NWP init time, not substation, allowing models to train globally across all substations (if they want to). Dagster is responsible for detecting bad data.
  • Configuration Management: Hydra combined with pydantic-settings. Dagster passes string overrides (e.g., model=xgboost_global) to trigger Hydra's Config Groups, swapping massive architectural parameter trees. The resolved YAML is logged to MLflow.
  • Experiment Tracking: MLflow.
  • Visualisation: Altair for plotting, Marimo for interactive data exploration and web apps.

The Universal Model Interface

To decouple the Dagster data pipeline from the ML code, all models are saved using native MLflow flavors (e.g., mlflow.xgboost.log_model), which serialize the raw model object directly (see ADR-005).

The Adapter Pattern: The model wrapper encapsulates the model weights and all translation logic.

  • Input translation: Transforms the canonical Polars DataFrame into the required model shape.
  • Output translation: Converts native model outputs into the strict target quantile schema.