Datasets

Download and store datasets

MedMNIST Datasets

download_medmnist

 download_medmnist (dataset:str, output_dir:str='.',
                    download_only:bool=False, save_images:bool=True)

*Downloads the specified MedMNIST dataset and saves the training, validation, and test datasets into the specified output directory. Images are saved as .png for 2D data and multi-page .tiff for 3D data, organized into folders named after their labels.

Args: - dataset: The MedMNIST dataset name (e.g., ‘pathmnist’, ‘bloodmnist’, etc.). - output_dir: Path where the images will be saved. - download_only: If True, only downloads the dataset, no processing or saving. - save_images: If True, save the images in the specified output directory.

Returns: - None, saves images in the specified output directory if save_images is True.*

	Type	Default	Details
dataset	str		The name of the MedMNIST dataset (e.g., ‘pathmnist’, ‘bloodmnist’, etc.).
output_dir	str	.	The path to the directory where the datasets will be saved.
download_only	bool	False	If True, only download the dataset into the output directory without processing.
save_images	bool	True	If True, save the images into the output directory as .png (2D datasets) or multipage .tiff (3D datasets) files.

Download data via Pooch

source

download_dataset

 download_dataset (base_url, expected_checksums, file_names, output_dir,
                   processor=None)

*Download a dataset using Pooch and save it to the specified output directory.

Parameters: base_url (str): The base URL from which the files will be downloaded. expected_checksums (dict): A dictionary mapping file names to their expected checksums. file_names (dict): A dictionary mapping task identifiers to file names. output_dir (str): The directory where the downloaded files will be saved. processor (callable, optional): A function to process the downloaded data. Defaults to None.*

Download data via Quilt/T4

Allen Institute Cell Science (AICS)

source

aics_pipeline

 aics_pipeline (n_images_to_download=40, image_save_dir=None)

image_target_paths, data_manifest = aics_pipeline(1, "../_data/aics")

Loading manifest: 100%|██████████| 77165/77165 [00:01<00:00, 44.1k/s]

print(image_target_paths)
data_manifest #.to_csv('aics_dataset.csv')

[]

	ProteinDisplayName	StructureSegmentationAlgorithmVersion	WorkflowId	NucMembSegmentationAlgorithm	CellIndex	Gene	WellId	StructureShortName	NucMembSegmentationAlgorithmVersion	WellName	...	Clone	Col	StructureDisplayName	DataSetId	ChannelNumber638	ChannelNumberBrightfield	PlateId	StructEducationName	SourceReadPath	FeatureExplorerURL
4131	Tom20	51	1	Matlab nucleus/membrane segmentation	1	TOMM20	24822	Mitochondria	1.3.0	E6	...	27	5	Mitochondria	3	1	6	3500001004	NaN	fovs/6677e50c_3500001004_100X_20170623_5-Scene...	https://cfe.allencell.org/?selectedPoint[0]=18...

1 rows × 47 columns

Dataset Manifest

Make a manifest of all of the files in csv form

source

manifest2csv

 manifest2csv (paths, data_manifest, signal, target, train_fraction=0.8,
               data_save_path_train='./train.csv',
               data_save_path_test='./test.csv')