Data_Engineering β Medical Imaging Cleanup Pipeline
Standardize diverse medical imaging datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory targets one dataset.
Companion repo to
DRDMsig/Omini3Dβ produces the standardized data that OmniMorph trains on.
Supported Datasets
| Subdirectory | Dataset | Modality |
|---|---|---|
AbdomenAtlas/ |
AbdomenAtlas | CT |
AbdomenCT1k/ |
AbdomenCT-1K | CT |
brats2019_clean/ |
BraTS 2019 | MRI (multi-sequence) |
brats2020_clean/ |
BraTS 2020 | MRI (multi-sequence) |
brats2021_clean/ |
BraTS 2021 | MRI (multi-sequence) |
kaggle_osic_clean/ |
Kaggle OSIC Pulmonary Fibrosis | CT |
MnM2_clean/ |
M&Ms-2 | Cardiac MRI |
MnMs_clean/ |
M&Ms | Cardiac MRI |
OAISIS_clean/ |
OASIS-1 / OASIS-2 | Brain MRI |
OAI_ZIB_clean/ |
OAI-ZIB (knee) | MRI |
PSMA_clean/ |
PSMA-FDG PET-CT (longitudinal) | PET + CT |
all/ |
Cross-dataset utilities (artifact plane removal) | β |
Each cleaned dataset writes:
- Resampled & clamped
.nii.gzimages / segmentations - Per-dataset
nifti_mappings.json failed_files.jsonlisting files the cleaner could not process
Repository Layout
<dataset>_clean/
βββ dataclean_<dataset>.py # main cleanup script (use highest version: _v2.py, _v3.py, ...)
βββ util.py # shared helpers (copied per dir, not imported)
βββ config_format.json # metadata schema for `meta_data` validation
βββ (optional) sample/, demo/ # tiny example NIfTI files for sanity checks
Usage
cd AbdomenAtlas/
python dataclean_abdomen_atlas_v2.py \
--target_path /path/to/raw/AbdomenAtlas \
--output_dir /path/to/output/AbdomenAtlas_clean
All scripts share the --target_path / --output_dir interface. Versioned scripts (_v2.py, _v3.py) supersede older versions; use the highest version unless investigating regressions.
Pipeline (per dataset)
- Load raw data (DICOM via
sitk.ImageSeriesReader, NIfTI viasitk.ReadImage, NRRD). - Extract metadata from headers, CSV files, or DICOM tags.
- Resample to isotropic spacing (
get_unisize_resamplerinutil.py). - Clamp intensities β CT:
[-300, 300]HU; MRI: per-dataset windows. - Process segmentation labels with identical resampling (nearest-neighbor).
- Validate image/label dimensions agree (
assert image.GetSize() == label.GetSize()). - Write standardized
.nii.gzand append tonifti_mappings.json.
Shared util.py API
| Function / class | Purpose |
|---|---|
meta_data |
Validates metadata against config_format.json; required fields: Modality, OriImg_path, Spacing_mm, Size, Dataset_name. Normalizes ambiguous terminology via synonym dictionaries. |
get_unisize_resampler(image) |
Builds a SimpleITK resampler for isotropic spacing; returns None if already isotropic. |
clamp_image(image, lo, hi) |
HU/intensity clamping via sitk.ClampImageFilter. |
Dependencies
pip install SimpleITK pandas numpy tqdm openpyxl
(No requirements.txt β install manually.)
What's Included / Excluded
- β
Cleanup scripts,
util.py,config_format.json, demographic CSVs. - β
A handful of tiny demo / sample
.nii.gzfiles inPSMA_clean/{sample,demo}/. - β Raw datasets (download from each dataset's official source).
- β Run logs from prior cleanup runs (
*.log). - β Intermediate test outputs (
MnM2_clean/test/).
License
MIT β see project root.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support