Data_Engineering β€” Medical Imaging Cleanup Pipeline

Standardize diverse medical imaging datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory targets one dataset.

Companion repo to DRDMsig/Omini3D β€” produces the standardized data that OmniMorph trains on.

Supported Datasets

Subdirectory Dataset Modality
AbdomenAtlas/ AbdomenAtlas CT
AbdomenCT1k/ AbdomenCT-1K CT
brats2019_clean/ BraTS 2019 MRI (multi-sequence)
brats2020_clean/ BraTS 2020 MRI (multi-sequence)
brats2021_clean/ BraTS 2021 MRI (multi-sequence)
kaggle_osic_clean/ Kaggle OSIC Pulmonary Fibrosis CT
MnM2_clean/ M&Ms-2 Cardiac MRI
MnMs_clean/ M&Ms Cardiac MRI
OAISIS_clean/ OASIS-1 / OASIS-2 Brain MRI
OAI_ZIB_clean/ OAI-ZIB (knee) MRI
PSMA_clean/ PSMA-FDG PET-CT (longitudinal) PET + CT
all/ Cross-dataset utilities (artifact plane removal) β€”

Each cleaned dataset writes:

  • Resampled & clamped .nii.gz images / segmentations
  • Per-dataset nifti_mappings.json
  • failed_files.json listing files the cleaner could not process

Repository Layout

<dataset>_clean/
β”œβ”€β”€ dataclean_<dataset>.py      # main cleanup script (use highest version: _v2.py, _v3.py, ...)
β”œβ”€β”€ util.py                      # shared helpers (copied per dir, not imported)
β”œβ”€β”€ config_format.json           # metadata schema for `meta_data` validation
└── (optional) sample/, demo/    # tiny example NIfTI files for sanity checks

Usage

cd AbdomenAtlas/
python dataclean_abdomen_atlas_v2.py \
    --target_path /path/to/raw/AbdomenAtlas \
    --output_dir  /path/to/output/AbdomenAtlas_clean

All scripts share the --target_path / --output_dir interface. Versioned scripts (_v2.py, _v3.py) supersede older versions; use the highest version unless investigating regressions.

Pipeline (per dataset)

  1. Load raw data (DICOM via sitk.ImageSeriesReader, NIfTI via sitk.ReadImage, NRRD).
  2. Extract metadata from headers, CSV files, or DICOM tags.
  3. Resample to isotropic spacing (get_unisize_resampler in util.py).
  4. Clamp intensities β€” CT: [-300, 300] HU; MRI: per-dataset windows.
  5. Process segmentation labels with identical resampling (nearest-neighbor).
  6. Validate image/label dimensions agree (assert image.GetSize() == label.GetSize()).
  7. Write standardized .nii.gz and append to nifti_mappings.json.

Shared util.py API

Function / class Purpose
meta_data Validates metadata against config_format.json; required fields: Modality, OriImg_path, Spacing_mm, Size, Dataset_name. Normalizes ambiguous terminology via synonym dictionaries.
get_unisize_resampler(image) Builds a SimpleITK resampler for isotropic spacing; returns None if already isotropic.
clamp_image(image, lo, hi) HU/intensity clamping via sitk.ClampImageFilter.

Dependencies

pip install SimpleITK pandas numpy tqdm openpyxl

(No requirements.txt β€” install manually.)

What's Included / Excluded

  • βœ… Cleanup scripts, util.py, config_format.json, demographic CSVs.
  • βœ… A handful of tiny demo / sample .nii.gz files in PSMA_clean/{sample,demo}/.
  • ❌ Raw datasets (download from each dataset's official source).
  • ❌ Run logs from prior cleanup runs (*.log).
  • ❌ Intermediate test outputs (MnM2_clean/test/).

License

MIT β€” see project root.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support