NGED Data API
nged_data.ckan
Generic CKAN client for interacting with NGED's Connected Data portal.
Classes
Functions
get_primary_substation_locations(api_key)
Note that 'Park Lane' appears twice (with different substation numbers).
Source code in packages/nged_data/src/nged_data/ckan.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
nged_data.cleaning
Data cleaning logic for substation flows.
This module provides functions to identify and handle problematic telemetry data: - "Stuck" values: Values where the rolling standard deviation is below a threshold, indicating a sensor that hasn't changed reading. - "Insane" values: Values that fall outside physically plausible min/max bounds.
All operations use backward-looking rolling windows and group by substation to prevent data leakage. Bad values are replaced with null to preserve the temporal grid rather than being removed or imputed.
Classes
Functions
clean_substation_flows(df, settings, group_by_cols=None)
Clean substation flows by replacing stuck and insane values with null.
This function identifies problematic telemetry data and replaces these values with null to preserve the temporal grid. It performs the following checks:
-
Stuck sensor detection: Uses a backward-looking rolling standard deviation (24-hour window) to detect sensors that haven't changed reading. Values are marked as stuck if the rolling standard deviation falls below
settings.data_quality.stuck_std_threshold. We use time-based rolling to correctly handle different temporal resolutions and missing data. -
Insane value detection: Flags values outside the range
[min_mw_threshold, max_mw_threshold]as insane. This captures physically implausible readings (e.g., MW > 100 for primary substations, or extreme negative values).
Both checks are performed per-substation to prevent cross-substation data leakage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Raw substation flows DataFrame with 'MW' and 'MVA' columns. |
required |
settings
|
Settings
|
Configuration containing data quality thresholds. |
required |
group_by_cols
|
Sequence[str] | None
|
Columns to group by for per-substation operations. Defaults to ["substation_number"]. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with stuck and insane values replaced with null. The temporal grid |
DataFrame
|
is preserved (no rows removed, no imputation). |
Example
from contracts.settings import Settings df = clean_substation_flows(raw_df, Settings(...))
Notes
- All rolling window operations are backward-looking to prevent data leakage.
- Null values from the original data are preserved as-is.
- Both 'MW' and 'MVA' columns are checked independently.
- After cleaning, downstream logic should use
pl.coalesce(['MW', 'MVA'])to create theMW_or_MVAcolumn.
Source code in packages/nged_data/src/nged_data/cleaning.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
nged_data.process_flows
Classes
MilfordHavenParser
Parser for Milford Haven format (unit=MW).
Source code in packages/nged_data/src/nged_data/process_flows.py
31 32 33 34 35 36 37 38 | |
ParserStrategy
Bases: Protocol
Protocol for CSV parsing strategies.
Source code in packages/nged_data/src/nged_data/process_flows.py
9 10 11 12 13 14 15 16 17 18 | |
Functions
can_parse(df)
Return True if this strategy can parse the given DataFrame.
Source code in packages/nged_data/src/nged_data/process_flows.py
12 13 14 | |
parse(df)
Parse the DataFrame and return it with standardized column names.
Source code in packages/nged_data/src/nged_data/process_flows.py
16 17 18 | |
RegentStreetParser
Parser for Regent Street format (unit=MVA).
Source code in packages/nged_data/src/nged_data/process_flows.py
21 22 23 24 25 26 27 28 | |
StandardFormatParser
Parser for standard NGED formats with various column names.
Source code in packages/nged_data/src/nged_data/process_flows.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
Functions
process_live_primary_substation_power_flows(csv_data)
Read a primary substation CSV and validate it against the schema.
Source code in packages/nged_data/src/nged_data/process_flows.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
nged_data.schemas
nged_data.substation_names.align
Classes
Functions
join_location_table_to_live_primaries(locations, live_primaries)
Joins the substation locations dataset to the list of live primary flow resources.
This function performs a "dance" of normalization because NGED uses slightly different naming conventions in different datasets. For example: - The locations dataset might use "Abington 33/11kv". - The live flows dataset might use "Abington Primary Transformer Flows".
We normalize these by stripping out common suffixes and voltage information to create a "simple_name" that can be used for joining. We also use a manual mapping file for cases where the automated normalization isn't enough.
Returns a DataFrame that can be validated against SubstationMetadata.
Source code in packages/nged_data/src/nged_data/substation_names/align.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
nged_data.utils
Functions
ensure_utc_timestamp_lazy(lf)
Ensures the timestamp column is UTC timezone-aware lazily.
This handles cases where Delta tables lose timezone metadata or Polars scans them as naive. It also handles non-UTC timezones by converting them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lf
|
LazyFrame
|
Polars LazyFrame with a 'timestamp' column. |
required |
Returns:
| Type | Description |
|---|---|
LazyFrame
|
LazyFrame with UTC-aware 'timestamp' column. |
Source code in packages/nged_data/src/nged_data/utils.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
find_one_match(predicate, haystack)
Find exactly one match in haystack. Raise a ValueError if haystack is empty, or if there is more than 1 match.
Source code in packages/nged_data/src/nged_data/utils.py
8 9 10 11 12 13 14 15 16 | |
get_partition_window(partition_key, lookback_days=0)
Calculates the partition start, end, and lookback start times.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
partition_key
|
str
|
The Dagster partition key (ISO format date string). |
required |
lookback_days
|
int
|
Number of days to look back from the partition start. |
0
|
Returns:
| Type | Description |
|---|---|
tuple[datetime, datetime, datetime]
|
A tuple of (partition_start, partition_end, lookback_start). |
Source code in packages/nged_data/src/nged_data/utils.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
scan_delta_table(path)
Scans a Delta table and ensures the timestamp column is UTC-aware.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the Delta table. |
required |
Returns:
| Type | Description |
|---|---|
LazyFrame
|
Polars LazyFrame with UTC-aware 'timestamp' column (if present). |
Source code in packages/nged_data/src/nged_data/utils.py
55 56 57 58 59 60 61 62 63 64 | |