Preprocessing Module

Functions for preprocessing

farmnet.data.preprocessing.add_time_cols(df) DataFrame[source]

Add time-based columns to a DataFrame indexed by datetime.

This function assumes that the input DataFrame has a DatetimeIndex and adds the following columns:

  • Year: The year extracted from the index.

  • Month: The month name (e.g., “January”), as an ordered categorical variable.

  • Day: The name of the day (e.g., “Monday”).

  • Hour: The hour of the day (0–23).

  • Minute: The minute of the hour (0–59).

Parameters:

df (pd.DataFrame) – A pandas DataFrame with a DatetimeIndex.

Returns:

The same DataFrame with additional time-based columns.

Return type:

pd.DataFrame

Raises:

AttributeError – If the DataFrame index does not support datetime attributes.

Example

>>> import pandas as pd
>>> from datetime import datetime
>>> from farmnet.data.preprocessing import add_time_cols
>>> dates = pd.date_range("2024-01-01 12:34", periods=2, freq="D")
>>> df = pd.DataFrame({"Value": [10, 20]}, index=dates)
>>> df = add_time_cols(df)
>>> df["Year"].tolist()
[2024, 2024]
>>> df["Month"].tolist()
['January', 'January']
>>> df["Day"].tolist()
['Monday', 'Tuesday']
>>> df["Hour"].tolist()
[12, 12]
>>> df["Minute"].tolist()
[34, 34]

If the DataFrame doesn’t have a DatetimeIndex, AttributeError gets raised:

>>> df = pd.DataFrame({"Value": [10, 20]}, index=[1,2])
>>> df = add_time_cols(df)
Traceback (most recent call last):
...
AttributeError: 'Index' object has no attribute 'year'
farmnet.data.preprocessing.compose(*functions: Callable[[DataFrame], DataFrame]) Callable[[DataFrame], DataFrame][source]

Composes multiple functions that each take a pandas DataFrame and return a pandas DataFrame.

This function combines a sequence of functions into a single function. The resulting function applies the functions from right to left (i.e., first applying the last function passed and then applying the next one to the result, and so on).

Parameters:

functions (ComposableFunction) – A sequence of functions to be composed. Each function should take a pandas DataFrame and return a pandas DataFrame.

Returns:

A new function that represents the composition of the input functions.

Return type:

ComposableFunction

Raises:

TypeError – If any of the input functions does not accept or return a pandas DataFrame.

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import compose, filter_constants, dropna
>>> df = pd.DataFrame({
...     'A': [1, 1, 1, 1],
...     'B': [1, 2, 1, 2],
...     'C': [3, 4, None, 3]
... })
>>> composed_func = compose(dropna, filter_constants)
>>> composed_df = composed_func(df)
>>> print(composed_df)
   B
0  1
1  2
2  1
3  2
>>> composed_func = compose(square_root, dropna)
>>> composed_df = composed_func(df)
Traceback (most recent call last):
...
TypeError: Series.dropna() got an unexpected keyword argument 'thresh'
farmnet.data.preprocessing.dropna(df, axis=1, thresh=1.0) DataFrame[source]

Drop rows or columns with missing values based on a threshold percentage.

Extends pandas’ dropna() by allowing a fractional threshold (0 to 1.0) instead of requiring an absolute count of non-NA values. The threshold is interpreted as a percentage of the axis length.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame to process

  • axis (int, optional) – Axis to drop missing values from (0 or ‘index’ for rows, 1 or ‘columns’ for columns)

  • thresh (float, optional) – Minimum percentage of non-NA values required (0.0 to 1.0)

Returns:

DataFrame with NA-containing rows/columns dropped according to threshold

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import dropna
>>> df = pd.DataFrame({'A': [1, 2, None], 'B': [None, None, None], 'C': [4, 5, 6]})
>>> df
    A     B  C
0  1.0  None  4
1  2.0  None  5
2  NaN  None  6
>>> dropna(df, axis=1, thresh=0.5)
    A  C
0  1.0  4
1  2.0  5
2  NaN  6
>>> dropna(df, axis=1, thresh=1)
C
0  4
1  5
2  6
farmnet.data.preprocessing.filter_bin(data, ws_col: str, p_col: str, sigma: float, cut_in: float, cut_out: float, bins=40) DataFrame[source]

Filter data points based on wind speed binning and power standard deviation.

This function: 1. Bins the data by wind speed into specified ranges 2. Calculates mean and standard deviation of power for each bin 3. Filters out points where power deviates more than sigma standard deviations from the bin mean

Parameters:
  • data (pandas.DataFrame) – Input DataFrame containing wind speed and power data

  • ws_col (str) – Name of the column containing wind speed values

  • p_col (str) – Name of the column containing power values

  • sigma (float) – Number of standard deviations to use as threshold for filtering

  • cut_in (float) – Minimum wind speed to consider (lower bound of first bin)

  • cut_out (float) – Maximum wind speed to consider (upper bound of last bin)

  • bins (int, optional) – Number of bins to create between cut_in and cut_out, defaults to 40

Returns:

Filtered DataFrame containing only inlier points

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import filter_bin
>>> data = pd.DataFrame({
...     'wind_speed': [3.0, 4.5, 5.0, 5.5, 25.0],
...     'power': [100, 500, 600, 620, 50]
... })
>>> filtered = filter_bin(data, 'wind_speed', 'power', 2.0, 4.0, 25.0, bins=5)
>>> filtered
wind_speed  power
1         4.5    500
2         5.0    600
3         5.5    620
farmnet.data.preprocessing.filter_constants(df) DataFrame[source]

Filter out constant or near-constant columns from a DataFrame.

This function identifies and removes columns where the standard deviation is extremely small relative to the magnitude of values (less than 1e-6), indicating the column contains nearly constant values.

Parameters:

df (pandas.DataFrame) – Input DataFrame to process

Returns:

DataFrame with constant/near-constant columns removed

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import filter_constants
>>> df = pd.DataFrame({
...     'A': [1, 1, 1, 1],
...     'B': [1, 2, 1, 2],
...     'C': [3, 3, 3, 3]
... })
>>> filtered = filter_constants(df)
>>> print(filtered.columns)
Index(['B'], dtype='object')

Note

The threshold for considering a column constant is when the ratio of standard deviation to square root of values is less than 1e-6.

Warning

[0,0,0,0] won’t be considered constant

farmnet.data.preprocessing.filter_corr(df, thresh: float = 0.95, pre_choice=None) DataFrame[source]

Filters columns of a DataFrame based on the correlation threshold.

This function removes highly correlated columns from a pandas DataFrame. It computes the correlation matrix and removes any columns that have an absolute correlation greater than the specified threshold with any other column. Optionally, columns can be retained in the result based on prior choices provided through pre_choice.

Parameters:
  • df (pd.DataFrame) – The input pandas DataFrame to be filtered.

  • thresh (float, optional) – The correlation threshold above which columns are considered highly correlated and will be discarded (default is 0.95).

  • pre_choice (list, optional) – A list of columns to always retain in the output DataFrame. If None, no columns are pre-selected (default is None).

Returns:

A pandas DataFrame with the filtered columns.

Return type:

pd.DataFrame

Raises:

ValueError – If df is not a pandas DataFrame.

Example

>>> import pandas as pd
>>> import numpy as np
>>> from farmnet.data.preprocessing import filter_corr
>>> df = pd.DataFrame({
...     'A': np.random.randn(100),
...     'B': np.random.randn(100),
...     'C': np.random.randn(100),
...     'D': np.random.randn(100)
... })
>>> df['B'] = df['A'] * 0.9 + df['B'] * 0.1  # Create high correlation between A and B
>>> df['C'] = df['A'] * 0.95 + df['C'] * 0.05  # High correlation between A and C
>>> filtered_df = filter_corr(df, thresh=0.9, pre_choice=['B'])
>>> filtered_df.columns
Index(['A', 'B', 'D'], dtype='object')
>>> filtered_df = filter_corr(df, thresh=0.9)
>>> filtered_df.columns
Index(['A', 'D'], dtype='object')
farmnet.data.preprocessing.filter_power(df, col: str, rated: float, thresh: float = 0.1) DataFrame[source]

Filter DataFrame to keep only rows where power values exceed a threshold percentage of rated power.

This function filters out power values that are below a specified fraction of the rated power.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing power measurements

  • col (str) – Name of the column containing power values to filter

  • rated (float) – Rated power value (reference value for threshold calculation)

  • thresh (float, optional) – Threshold fraction of rated power (0 to 1), defaults to 0.1

Returns:

Filtered DataFrame containing only rows above the power threshold

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import filter_power
>>> df = pd.DataFrame({'power': [0, 50, 100, 150, 200]})
>>> filtered = filter_power(df, 'power', rated=200, thresh=0.25)
>>> print(filtered)
power
2   100
3   150
4   200
farmnet.data.preprocessing.filter_regexp(df, regex: str = '^((?!Min).)*[^xn]$') DataFrame[source]

Filter DataFrame columns using a regular expression.

This function uses the provided regular expression to filter columns by name. By default, it excludes any column name that:

  • Contains the substring Min

  • Ends with the letter x or n

Parameters:
  • df (pd.DataFrame) – The DataFrame whose columns should be filtered.

  • regex (str) – A regular expression pattern to apply to column names.

Returns:

A DataFrame containing only the columns that match the regex pattern.

Return type:

pd.DataFrame

Raises:

re.error – If the regular expression is invalid.

Example

>>> import pandas as pd
>>> import numpy as np
>>> from farmnet.data.preprocessing import filter_regexp
>>> df = pd.DataFrame({
...     "Speed": [10, 20],
...     "MinPower": [1, 2],
...     "Max": [3, 4],
...     "Heightn": [5, 6],
...     "Weight": [7, 8]
... })
>>> filtered_df = filter_regexp(df)
>>> filtered_df.columns.tolist()
['Speed', 'Weight']
>>> filtered_df = filter_regexp(df,r'^M\w*')
>>> filtered_df.columns.tolist()
['MinPower', 'Max']
>>> filtered_df = filter_regexp(df,'(*)abc')
Traceback (most recent call last):
...
re.PatternError: nothing to repeat at position 1
farmnet.data.preprocessing.join_frames(frames: list[DataFrame]) DataFrame[source]

Joins a list of pandas DataFrames using an outer join on their indices.

The function performs an outer join on the list of DataFrames, merging them on their indices. Duplicate indices in the final result are removed.

Parameters:

frames (list[pd.DataFrame]) – List of pandas DataFrames to be joined.

Returns:

A single DataFrame resulting from the outer join of all input DataFrames, with duplicated indices removed.

Return type:

pd.DataFrame

Raises:

IndexError – If the input list is empty.

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import join_frames
>>> df1 = pd.DataFrame({"A": [1, 2]}, index=["a", "b"])
>>> df2 = pd.DataFrame({"B": [3, 4]}, index=["b", "c"])
>>> df3 = pd.DataFrame({"C": [5]}, index=["a"])
>>> join_frames([df1, df2, df3])
    A    B    C
a  1.0  NaN  5.0
b  2.0  3.0  NaN
c  NaN  4.0  NaN
>>> join_frames([])
Traceback (most recent call last):
...
IndexError: list index out of range
farmnet.data.preprocessing.log_process(func)[source]
farmnet.data.preprocessing.remove_interval(df: DataFrame, timestamps: list[str | Timestamp], delta: str | Timedelta = '30D')[source]

Removes rows from a DataFrame that fall within a specified time interval around given timestamps.

This function removes rows from the DataFrame where the index falls within a specified time window (before and after each timestamp) defined by the delta parameter. The timestamps parameter accepts a list of timestamps, and for each timestamp, rows within the interval of delta before and after the timestamp are excluded.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame from which rows are to be removed.

  • timestamps (list[str | pd.Timestamp]) – A list of timestamps around which to remove rows. Each timestamp will have a time window of size delta before and after it.

  • delta (str | pd.Timedelta, optional) – The time window size (default is “30D”, i.e., 30 days). The window is applied symmetrically before and after each timestamp.

Returns:

A new DataFrame with rows removed within the specified time intervals.

Return type:

pd.DataFrame

Raises:

ValueError – If df does not have a datetime index.

Example

>>> import pandas as pd
>>> from farmnet.data.preprocessing import remove_interval
>>> df = pd.DataFrame({
...     'value': range(10)
... }, index=pd.date_range('2025-01-01', periods=10, freq='D'))
>>> df
        value
2025-01-01      0
2025-01-02      1
2025-01-03      2
2025-01-04      3
2025-01-05      4
2025-01-06      5
2025-01-07      6
2025-01-08      7
2025-01-09      8
2025-01-10      9
>>> remove_interval(df, timestamps=["2025-01-05"], delta="2D")
        value
2025-01-01      0
2025-01-02      1
2025-01-08      7
2025-01-09      8
2025-01-10      9
farmnet.data.preprocessing.resample(df, period: str = 'D') DataFrame[source]

Resample time series data to specified frequency and compute mean values.

This function resamples a time-indexed DataFrame to a new frequency and calculates the mean of each numeric column for each new time period. The input DataFrame must have a DateTimeIndex.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with DateTimeIndex to be resampled

  • period (str, optional) – Resampling frequency string (default: “D” for daily) Common options: - “h” for hourly - “D” for daily - “W” for weekly - “M” for monthly - “Q” for quarterly - “A” or “Y” for yearly

Returns:

Resampled DataFrame with mean values for each period

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> from farmnet.data.preprocessing import resample
>>> # Create sample time series data
>>> date_rng = pd.date_range(start='1/1/2020', end='1/10/2020', freq='h')
>>> df = pd.DataFrame(date_rng, columns=['date'])
>>> df['data'] = np.random.randn(len(date_rng))
>>> df = df.set_index('date')
>>>
>>> # Resample to daily means
>>> daily_means = resample(df)
>>> print(len(daily_means))
10
farmnet.data.preprocessing.square_root(x)[source]

Compute the square root of the sum of squares of an array.

Given an array x, this function calculates:

\[\sqrt{\sum_{i} x_i^2}\]
Parameters:

x (array_like) – Input array (any shape) for which to compute the square root of the sum of squares.

Returns:

Euclidean norm of the input array.

Return type:

float

Example

>>> import numpy as np
>>> from farmnet.data.preprocessing import square_root
>>> square_root([3, 4])
np.float64(5.0)
>>> square_root(np.array([1, 1, 1, 1]))
np.float64(2.0)
>>> square_root([0])
np.float64(0.0)
>>> round(square_root([1, 2, 3]), 4)
np.float64(3.7417)