Wranglers Module

Wranglers for transforming various data sources to the farmnet data format.

Configuration file

Configuration file to map data source files to farmnet data format.

farmnet.data.wranglers.get_column_mapping(config_path: str | Path | None = None) dict[source]

Retrieve the column mapping configuration.

This function reads the dataset’s column mapping from the configuration file and returns a dictionary where keys are column names from the source dataset, and values are their corresponding standardized names.

Parameters:

config_path (str | Path | None) – Path to the configuration file. If None, the default configuration is used.

Returns:

A dictionary mapping source column names to standardized column names.

Return type:

dict

Examples:

By specifying the path to a configuration file, the column mapping of file is returned:

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> ds = get_column_mapping(default_cfg_path)
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "Nacelle position (°)": "nacelle_direction",
    "Power (kW)": "power",
    "Wind direction (°)": "wind_direction",
    "Wind speed (m/s)": "wind_speed",
    "Wind turbine ID": "wt_id"
}

If config_path is None, get_column_mapping() returns the csv column mapping of the default database set with set_default_cfg():

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> ds = get_column_mapping()
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "Nacelle position (°)": "nacelle_direction",
    "Power (kW)": "power",
    "Wind direction (°)": "wind_direction",
    "Wind speed (m/s)": "wind_speed",
    "Wind turbine ID": "wt_id"
}
farmnet.data.wranglers.get_csv_fmt(config_path: str | Path | None = None) dict[source]

Get csv format configuration.

Parameters:

config_path (str | Path | None) – Path to the configuration file. If None, the default configuration is used.

Returns:

A dictionary containing csv formatting details.

Return type:

dict

The returned dictionary contains:

  • encoding (str): Encoding to use for UTF when reading (ex. ‘utf-8’)

  • sep (str): Character or regex pattern to treat as the delimiter.

  • header (int): Row number(s) containing column labels and marking the start of the data (zero-indexed).

Examples:

By specifying the path to a configuration file, the csv configuration of file is returned:

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> ds = get_csv_fmt(default_cfg_path)
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "encoding": "utf8",
    "header": 0,
    "sep": ","
}

If config_path is None, get_csv_fmt() returns the csv configuration of the default database set with set_default_cfg():

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> ds = get_csv_fmt()
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "encoding": "utf8",
    "header": 0,
    "sep": ","
}
farmnet.data.wranglers.get_dataset(config_path: str | Path | None = None) dict[source]

Return source of a dataset.

Parameters:

config_path (str | Path | None) – Path to the configuration file. If None, the default configuration is used.

Returns:

A dictionary in the FarmNet format.

Return type:

dict

The returned dictionary contains:

  • root_dir (str): Root directory of the database.

  • data (str): Name of the data file.

  • static (str): Name of the static file.

Examples:

By specifying the path to a configuration file, the dataset directory, data file, and static file is returned:

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> ds = get_dataset(default_cfg_path)
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "data": "featured_windeurope_data.parquet",
    "root_dir": "kelmarsh_data_imputation",
    "static": "Kelmarsh_WT_static.csv"
}

If config_path is None, get_dataset() returns the database path informations of the default database set with set_default_cfg():

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> ds = get_dataset()
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "data": "featured_windeurope_data.parquet",
    "root_dir": "kelmarsh_data_imputation",
    "static": "Kelmarsh_WT_static.csv"
}
farmnet.data.wranglers.get_index_fmt(config_path: str | Path | None = None) dict[source]

Retrieve index format configuration.

This function extracts index-related configuration details from a given configuration file. If no file is provided, it uses the default configuration.

Parameters:

config_path (str | Path | None) – Path to the configuration file. If None, the default configuration is used.

Returns:

A dictionary containing index formatting details.

Return type:

dict

The returned dictionary contains:

  • name_mapping (tuple[str, str]): Mapping of source index name to target index name.

  • dt_format (str): Date/time format unit.

  • tz_mapping (tuple[str, str]): Mapping of source time zone to target time zone.

Examples:

By specifying the path to a configuration file, the index-related configuration details of file is returned:

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> ds = get_index_fmt(default_cfg_path)
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "dt_format": "ns",
    "name_mapping": [
        "# Date and time",
        "datetime"
    ],
    "tz_mapping": [
        "UTC",
        "UTC"
    ]
}

If config_path is None, get_index_fmt() returns index-related configuration details of the default database set with set_default_cfg():

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> ds = get_index_fmt()
>>> print(json.dumps(ds, indent=4, sort_keys=True, ensure_ascii=False))
{
    "dt_format": "ns",
    "name_mapping": [
        "# Date and time",
        "datetime"
    ],
    "tz_mapping": [
        "UTC",
        "UTC"
    ]
}
farmnet.data.wranglers.read_raw(fpath: Path | str, *, csv_fmt: dict | None = None, index_fmt: dict | None = None, **kwargs) DataFrame[source]

Read a raw data file.

Parameters:
  • fpath (Path or str) – Path to the data file.

  • csv_fmt (dict or None) – Dictionary of CSV format options to be passed to pandas.read_csv(). If not provided, the output of get_csv_fmt() is used.

  • index_fmt (dict or None) – Information about the index of the returned data. If not provided, the output of get_index_fmt() is used.

  • kwargs – Additional keyword arguments passed to pandas.read_csv().

Returns:

Formatted raw data.

Return type:

pandas.DataFrame

Examples:

>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> download_path = Path(getenv("DOWNLOAD_PATH", "./data"))
>>> dataset = get_dataset()
>>> data_path = download_path / dataset["data"]
>>> raw_data_path = download_path / "kelmarsh_raw.csv"
>>> df_raw = read_raw(raw_data_path)
>>> print(df_raw.to_string(max_cols=5, max_rows=10))
                           Wind speed (m/s)  Wind speed, Standard deviation (m/s)  ...  Tower Acceleration Y, StdDev (mm/ss)  Wind turbine ID
datetime                                                                           ...
2022-01-01 00:00:00+00:00          6.781222                              1.182439  ...                             11.422541              228
2022-01-01 00:10:00+00:00          6.936052                              1.287222  ...                             16.457248              228
2022-01-01 00:20:00+00:00          7.294642                              1.430000  ...                             16.063823              228
2022-01-01 00:30:00+00:00          8.080467                              1.023509  ...                             18.288907              228
2022-01-01 00:40:00+00:00          7.021328                              1.066915  ...                             22.059917              228
...                                     ...                                   ...  ...                                   ...              ...
2022-12-31 23:10:00+00:00          8.712688                              1.216442  ...                             28.254154              233
2022-12-31 23:20:00+00:00          9.149686                              1.182500  ...                             15.370069              233
2022-12-31 23:30:00+00:00          9.571797                              1.619526  ...                             13.412479              233
2022-12-31 23:40:00+00:00          9.549912                              1.504496  ...                             18.748812              233
2022-12-31 23:50:00+00:00          9.215081                              1.208763  ...                             19.858008              233
>>> df_raw.index.name
'datetime'
>>> len(df_raw)
315360
farmnet.data.wranglers.set_default_cfg(config_path: str | Path)[source]

Setting default configuration file.

Parameters:

config_path (str | Path | None) – Path to the configuration file. If None, the default configuration is used.

Examples:

If config_path is None, get_dataset() returns the configuration set via set_default_cfg():

>>> import json
>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> rcConfig_test = _test_global_cfg(set_default_cfg,default_cfg_path)
>>> print(json.dumps(rcConfig_test, indent=4, sort_keys=True, ensure_ascii=False))
{
    "columns": [
        {
            "name": "wind_direction",
            "name-from-source": "Wind direction (°)"
        },
        {
            "name": "nacelle_direction",
            "name-from-source": "Nacelle position (°)"
        },
        {
            "name": "wind_speed",
            "name-from-source": "Wind speed (m/s)"
        },
        {
            "name": "power",
            "name-from-source": "Power (kW)"
        },
        {
            "name": "wt_id",
            "name-from-source": "Wind turbine ID"
        }
    ],
    "csv": {
        "encoding": "utf8",
        "header": 0,
        "sep": ","
    },
    "dataset": {
        "data": "featured_windeurope_data.parquet",
        "root_dir": "kelmarsh_data_imputation",
        "static": "Kelmarsh_WT_static.csv"
    },
    "index": {
        "name": "datetime",
        "name-from-source": "# Date and time",
        "time-zone": "UTC",
        "time-zone-from-source": "UTC",
        "unit": "ns"
    }
}
farmnet.data.wranglers.to_farmnet(df: DataFrame, *, column_mapping: dict) DataFrame[source]

Transform a dataframe containing raw data into a FarmNet dataframe. To be used with the FarmNet data pipeline.

The FarmNet dataframe is defined in the FarmNet data manifest and is used as a data interface for the FarmNet data pipeline.

Parameters:
  • df (pd.DataFrame) – DataFrame containing raw data.

  • column_mapping (dict) – Dictionary mapping raw column names to FarmNet column names.

Returns:

A transformed FarmNet-compatible DataFrame.

Return type:

pandas.DataFrame

Examples:

>>> default_cfg_path = Path(getenv("CONFIG_PATH", "examples/kelmarsh.toml"))
>>> set_default_cfg(default_cfg_path)
>>> dataset = get_dataset()
>>> download_path = Path(getenv("DOWNLOAD_PATH", "./data"))
>>> raw_data_path = download_path / "kelmarsh_raw.csv"
>>> df_raw = read_raw(raw_data_path)
>>> column_mapping = get_column_mapping()
>>> df_farmnet = to_farmnet(df_raw, column_mapping=column_mapping)
>>> print(df_farmnet.to_string(max_cols=6, max_rows=10))
                        wind_direction  nacelle_direction  wind_speed        power  wt_id
datetime
2022-01-01 00:00:00+00:00      185.795348         193.731354    6.781222   630.889598    228
2022-01-01 00:10:00+00:00      189.458687         193.731354    6.936052   809.339449    228
2022-01-01 00:20:00+00:00      188.648729         193.731354    7.294642   893.607333    228
2022-01-01 00:30:00+00:00      188.826550         193.731354    8.080467   995.583734    228
2022-01-01 00:40:00+00:00      191.252213         193.731354    7.021328   926.519441    228
...                                   ...                ...         ...          ...    ...
2022-12-31 23:10:00+00:00      210.193670         205.457916    8.712688  1447.101428    233
2022-12-31 23:20:00+00:00      208.465164         205.457916    9.149686  1572.766687    233
2022-12-31 23:30:00+00:00      213.539677         205.457916    9.571797  1653.457245    233
2022-12-31 23:40:00+00:00      213.684894         205.457916    9.549912  1670.531378    233
2022-12-31 23:50:00+00:00      209.309463         205.457916    9.215081  1563.665674    233
>>> id(df_farmnet) == id(df_raw)
False
>>> list(df_farmnet.columns) == list(column_mapping.values())
True