Datasets utils Module

Functions for handling PyTorch Geometric Datasets

farmnet.data.datasets.utils.dataset_sample(dataset: InMemoryDataset, sample_size: int) InMemoryDataset[source]

Randomly samples a subset from the given dataset without replacement.

Parameters:
  • dataset – The input dataset to sample from, an instance of InMemoryDataset.

  • sample_size – The number of samples to select.

Returns:

A new InMemoryDataset containing the randomly selected samples.

Raises:

IndexError – If the sample size is greater than the dataset length.

Example

>>> import numpy as np
>>> from torch_geometric.data import InMemoryDataset
>>> class DummyDataset(InMemoryDataset):
...     def __init__(self, length):
...         super().__init__()
...         self.data_list = [i for i in range(length)]
...     def __len__(self):
...         return len(self.data_list)
...     def copy(self, idx):
...         new_ds = DummyDataset(0)
...         new_ds.data_list = [self.data_list[i] for i in idx]
...         return new_ds
>>> np.random.seed(42)
>>> dataset = DummyDataset(10)
>>> sampled_dataset = dataset_sample(dataset, 5)
>>> len(sampled_dataset)
5
>>> all(item in dataset.data_list for item in sampled_dataset.data_list)
True

Warning

TODO replace DummyDataset with Kelmarsh

farmnet.data.datasets.utils.load_dataset(path: str | Path) InMemoryDataset[source]

Loads the KelmarshDataset from the specified path with predefined features and target.

Parameters:

path – The path to the dataset directory as a string or Path object.

Returns:

An instance of InMemoryDataset containing the loaded data.

Example

# >>> from farmnet.data.datasets.
# >>> from pathlib import Path
# >>> # Assuming KelmarshDataset is correctly defined and available
# >>> dataset = load_dataset(Path("examples"))
# >>> isinstance(dataset, InMemoryDataset)
# True
farmnet.data.datasets.utils.train_test_split(dataset: InMemoryDataset, test_size: float = 0.2, seed: int = 0) tuple[Any, Any][source]

Splits a dataset into training and testing subsets.

Parameters:
  • dataset – The input dataset to split, an instance of InMemoryDataset.

  • test_size – The proportion of the dataset to include in the test split (default is 0.2).

  • seed – Random seed for reproducibility (default is 0).

Returns:

A tuple containing two datasets (train_dataset, test_dataset).

Example

>>> import numpy as np
>>> from torch_geometric.data import InMemoryDataset
>>> class DummyDataset(InMemoryDataset):
...     def __init__(self, length):
...         super().__init__()
...         self.data_list = [i for i in range(length)]
...     def __len__(self):
...         return len(self.data_list)
...     def index_select(self, idx):
...         new_ds = DummyDataset(0)
...         new_ds.data_list = [self.data_list[i] for i in idx]
...         return new_ds
>>> np.random.seed(0)
>>> dataset = DummyDataset(10)
>>> train_ds, test_ds = train_test_split(dataset, test_size=0.3, seed=42)
>>> len(train_ds)
7
>>> len(test_ds)
3

Warning

TODO replace DummyDataset with Kelmarsh