vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset#
- vak.prep.parametric_umap.parametric_umap.prep_parametric_umap_dataset(data_dir: str | Path, purpose: str, output_dir: str | Path | None = None, audio_format: str | None = None, spect_params: dict | None = None, annot_format: str | None = None, annot_file: str | Path | None = None, labelset: set | None = None, context_s: float = 0.015, train_dur: int | None = None, val_dur: int | None = None, test_dur: int | None = None, train_set_durs: list[float] | None = None, num_replicates: int | None = None, spect_key: str = 's', timebins_key: str = 't')[source]#
Prepare datasets for neural network models that perform a dimensionality reduction task.
For general information on dataset preparation, see the docstring for
vak.prep.prep().- Parameters:
data_dir (str, Path) β Path to directory with files from which to make dataset.
purpose (str) β Purpose of the dataset. One of {βtrainβ, βevalβ, βpredictβ, βlearncurveβ}. These correspond to commands of the vak command-line interface.
output_dir (str) β Path to location where data sets should be saved. Default is
None, in which case it defaults todata_dir.audio_format (str) β Format of audio files. One of {βwavβ, βcbinβ}. Default is
None, but eitheraudio_formatorspect_formatmust be specified.spect_params (dict, vak.config.SpectParams) β Parameters for creating spectrograms. Default is
None.annot_format (str) β Format of annotations. Any format that can be used with the :module:`crowsetta` library is valid. Default is
None.labelset (str, list, set) β Set of unique labels for vocalizations. Strings or integers. Default is
None. If notNone, then files will be skipped where the associated annotation contains labels not found inlabelset.labelsetis converted to a Pythonsetusingvak.converters.labelset_to_set(). See help for that function for details on how to specifylabelset.train_dur (float) β Total duration of training set, in seconds. When creating a learning curve, training subsets of shorter duration will be drawn from this set. Default is None.
val_dur (float) β Total duration of validation set, in seconds. Default is None.
test_dur (float) β Total duration of test set, in seconds. Default is None.
train_set_durs (list) β of int, durations in seconds of subsets taken from training data to create a learning curve, e.g. [5, 10, 15, 20].
num_replicates (int) β number of times to replicate training for each training set duration to better estimate metrics for a training set of that size. Each replicate uses a different randomly drawn subset of the training data (but of the same duration).
spect_key (str) β key for accessing spectrogram in files. Default is βsβ.
timebins_key (str) β key for accessing vector of time bins in files. Default is βtβ.
- Returns:
dataset_df (pandas.DataFrame) β That represents a dataset.
dataset_path (pathlib.Path) β Path to csv saved from
dataset_df.