vak.prep.frame_classification.make_splits.make_splits#
- vak.prep.frame_classification.make_splits.make_splits(dataset_df: DataFrame, dataset_path: str | Path, input_type: str, purpose: str, labelmap: dict, audio_format: str | None = None, spect_key: str = 's', timebins_key: str = 't', freqbins_key: str = 'f') DataFrame[source]#
Make each split of a frame classification dataset.
This function takes a
pandas.Dataframereturned byvak.prep.spectrogram_dataset.prep_spectrogram_dataset()orvak.prep.audio_dataset.prep_audio_dataset(), after it has been assigned a ‘split’ column, and then copies, moves, or generates the required files as appropriate for each split.For each unique ‘split’ in the
pandas.Dataframe, a directory is made insidedataset_path. At a high level, all files needed for working with that split will be in that directory E.g., thetraindirectory insidedataset_pathwould have all the files for every row indataset_dffor whichdataset_df['split'] == 'train'.The inputs to the neural network model are moved or copied into the split directory, or generated if necessary. If the
input_typeis ‘audio’, then the audio files are copied from their original directory. If theinput_typeis ‘spect’, and the spectrogram files are already indataset_path, they are moved into the split directory (under the assumption they were generated byvak.prep.spectrogram_dataset.audio_helper). If they are npz files, but they are not indataset_path, then they are validated to make sure they have the appropriate keys, and then copied into the split directory. This could be the case if the files were generated by another program. If they are mat files, they will be converted to npz with the default keys for arrays, and then saved in a new npz file in the split directory. This step is required so that all dataset prepared byvakare in a “normalized” or “canonicalized” format.In addition to copying or moving the audio or spectrogram files that are inputs to the neural network model, other npy files are made for each split and saved in the corresponding directory. This function creates one npy file for each row in
dataset_df. It has the extension ‘.frame_labels.npy’, and contains a vector where each element is the target label that the network should predict for the corresponding frame. Taken together, the audio or spectrogram file in each row along with its corresponding frame labels are the data for each sample \((x, y)\) in the dataset, where \(x_t\) supplies the “frames”, and \(y_t\) is the frame labels.This function also creates two additional npy files for each split. These npy files are “indexing” vectors that are used by
vak.datasets.frame_classification.WindowDatasetandvak.datasets.frame_classification.FramesDataset. These vectors make it possible to work with files, to avoid loading the entire dataset into memory, and to avoid working with memory-mapped arrays. The first is thesample_idsvector, that represents the “ID” of any sample \((x, y)\) in the split. We use these IDs to load the array files corresponding to the samples. For a split with \(m\) samples, this will be an array of length \(T\), the total number of frames across all samples, with elements \(i \in (0, 1, ..., m - 1)\) indicating which frames correspond to which sample \(m_i\): \((0, 0, 0, ..., 1, 1, ..., m - 1, m -1)\). The second vector is theinds_in_samplevector. This vector is the same length assample_ids, but its values represent the indices of frames within each sample \(x_t\). For a data set with \(T\) total frames across all samples, where \(t_i\) indicates the number of frames in each \(x_i\), this vector will look like \((0, 1, ..., t_0, 0, 1, ..., t_1, ... t_m)\).- Parameters:
dataset_df (pandas.DataFrame) – A
pandas.DataFramereturned byvak.io.dataframe.from_files()with a'split'column added, as a result of callingvak.io.dataframe.from_files()or because it was added “manually” by callingvak.core.prep.prep_helper.add_split_col()(as is done for ‘predict’ when the entireDataFramebelongs to this “split”).dataset_path (pathlib.Path) – Path to directory that represents dataset.
input_type (str) – The type of input to the neural network model. One of {‘audio’, ‘spect’}.
purpose (str) – A string indicating what the dataset will be used for. One of {‘train’, ‘eval’, ‘predict’, ‘learncurve’}. Determined by
vak.core.prep.prep()using the TOML configuration file.labelmap (dict) – A
dictthat maps a set of human-readable string labels to the integer classes predicted by a neural network model. As returned byvak.labels.to_map().audio_format (str) – A
stringrepresenting the format of audio files. One of :constant:`vak.common.constants.VALID_AUDIO_FORMATS`.spect_key (str) – Key for accessing spectrogram in files. Default is ‘s’.
timebins_key (str) – Key for accessing vector of time bins in files. Default is ‘t’.
freqbins_key (str) – key for accessing vector of frequency bins in files. Default is ‘f’.
- Returns:
dataset_df_out – The
dataset_dfwith splits sorted by increasing frequency of labels (seedataset_arrays()), and with columns added containing the npy files for each row.- Return type: