Datasets utils Module
Functions for handling PyTorch Geometric Datasets
- farmnet.data.datasets.utils.dataset_sample(dataset: InMemoryDataset, sample_size: int) InMemoryDataset[source]
Randomly samples a subset from the given dataset without replacement.
- Parameters:
dataset – The input dataset to sample from, an instance of InMemoryDataset.
sample_size – The number of samples to select.
- Returns:
A new InMemoryDataset containing the randomly selected samples.
- Raises:
IndexError – If the sample size is greater than the dataset length.
Example
>>> import numpy as np >>> from torch_geometric.data import InMemoryDataset >>> class DummyDataset(InMemoryDataset): ... def __init__(self, length): ... super().__init__() ... self.data_list = [i for i in range(length)] ... def __len__(self): ... return len(self.data_list) ... def copy(self, idx): ... new_ds = DummyDataset(0) ... new_ds.data_list = [self.data_list[i] for i in idx] ... return new_ds >>> np.random.seed(42) >>> dataset = DummyDataset(10) >>> sampled_dataset = dataset_sample(dataset, 5) >>> len(sampled_dataset) 5 >>> all(item in dataset.data_list for item in sampled_dataset.data_list) True
Warning
TODO replace DummyDataset with Kelmarsh
- farmnet.data.datasets.utils.load_dataset(path: str | Path) InMemoryDataset[source]
Loads the KelmarshDataset from the specified path with predefined features and target.
- Parameters:
path – The path to the dataset directory as a string or Path object.
- Returns:
An instance of InMemoryDataset containing the loaded data.
Example
# >>> from farmnet.data.datasets. # >>> from pathlib import Path # >>> # Assuming KelmarshDataset is correctly defined and available # >>> dataset = load_dataset(Path("examples")) # >>> isinstance(dataset, InMemoryDataset) # True
- farmnet.data.datasets.utils.train_test_split(dataset: InMemoryDataset, test_size: float = 0.2, seed: int = 0) tuple[Any, Any][source]
Splits a dataset into training and testing subsets.
- Parameters:
dataset – The input dataset to split, an instance of InMemoryDataset.
test_size – The proportion of the dataset to include in the test split (default is 0.2).
seed – Random seed for reproducibility (default is 0).
- Returns:
A tuple containing two datasets (train_dataset, test_dataset).
Example
>>> import numpy as np >>> from torch_geometric.data import InMemoryDataset >>> class DummyDataset(InMemoryDataset): ... def __init__(self, length): ... super().__init__() ... self.data_list = [i for i in range(length)] ... def __len__(self): ... return len(self.data_list) ... def index_select(self, idx): ... new_ds = DummyDataset(0) ... new_ds.data_list = [self.data_list[i] for i in idx] ... return new_ds >>> np.random.seed(0) >>> dataset = DummyDataset(10) >>> train_ds, test_ds = train_test_split(dataset, test_size=0.3, seed=42) >>> len(train_ds) 7 >>> len(test_ds) 3
Warning
TODO replace DummyDataset with Kelmarsh