🎉 Deepchecks’ New Major Release: Evaluation for LLM-Based Apps!  Click here to find out more đźš€

10 Best Free Healthcare Datasets for Machine Learning

This blog post was written by Inderjit Singh as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that's accepted by our reviewers.

Introduction

Open-source datasets are critical for Machine Learning models. They can be used either to create a baseline benchmark for a new architecture, or train an existing architecture to get a fundamental understanding of the problem. Since training from scratch requires a significant number of training examples, an open-source dataset of considerable size can help bootstrap your training process, eliminating the cost of the data annotations process. The availability of such free datasets in healthcare is difficult to find since there is a significant element of PIIs (Personally Identifiable Information) and regulatory constraints that make gathering those datasets particularly expensive. In this blog, we created a list based on the authenticity, ease of use, and completeness of the top 10 healthcare datasets that can be utilized for a wide variety of Machine Learning implementations.

We categorized these datasets according to the Machine Learning implementation specific areas (i.e., computer vision via 3D, CT scans, X-rays), tabular datasets (Time series), and NLP.

Computer Vision and Natural Language Processing (NLP)

1. MedPix

MedPix is free-to-access healthcare data for Machine Learning, consisting of medical images, teaching cases, and clinical topics. This is suitable for use-cases where we intend to integrate Computer Vision and NLP. The content inside the dataset is organized based on the disease location (organ system to which a disease belongs) and patient profiles, among others. There is also an Image Classification and Image Captioning component available in case we intend to make use of the same.

Details:
– Total number of cases: 12,000
– Total number of topics: 9,000
– Total number of images: 59,000
– Detailed description paper: Accessible here.
– Publisher: National Library of Medicine.

2. The Cancer Imaging Archive (TCIA)

TCIA will not technically qualify as a single dataset since it’s a large archive of a wide variety of cancer-related image datasets. The dataset is public and open for download, free of cost. The data is organized into a set of collections. These collections are essentially categories created out of differentiations mentioned in detail below.

Details:
– Total Number of different collections: 172
Common diseases (e.g., lung cancer)
– Image Modality or type of the image (e.g., MRI, CT scan, Digital  Histopathology/DH)
Research Focus
– Publisher: National Cancer Institute

3. COVID-19 X-ray Dataset

This dataset contains images for AP (Anterior-Posterior) / PA (Posterior Anterior) chest X-rays with polygons of pixel-level lung segmentation. Individual images in the dataset contain two lungs along with their segmentation mask polygons, including the posterior region from behind the heart. These individual images have tags for the type of pneumonia (i.e., viral, bacterial, fungal, or healthy). A healthy tag essentially means that there is no pneumonia present in the image. If the patient has been tested for COVID-19, then there will be additional tags like age, gender, temperature, and incubation status available.

Details:
– Total number of images: 6500
– Total number of covid-positive patients: 517
– File Types: png, zip, txt
– Dataset formats available: COCO, VOC, or Darwin JSON
– Image resolution: larges – 5600×4700 – smallest 156×156
– Unlabelled masks: 63 (recommended to be discarded for x-ray analysis)
– Publisher: Darwin Labs

(Source)

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

4. COVID-19 Image Dataset

This dataset contains cleaned images of COVID-19-positive patients along with images of other viral pneumonia and some normal chest X-ray images which are neither COVID-positive nor do they have any other diagnostics for pneumonia.

Details:
– Total number of images: 317
– A total number of covid positive images: 137
– EDA and basic working code: Accessible here
– Total Dataset size: 169.64 MB
– Classes: Covid, Normal, Viral Pneumonia
– File Type: jpeg
– Publisher: University of Montreal

5. CT Medical Images

This dataset is created so as to allow for a wide variety of methods to be tested in order to examine different trends in CT (Computed Tomography) images. It is created from a subset of images from the cancer imaging archive, consisting of the middle slice of all the CT images that were taken with valid tags such as age, modality, and contrasts.

Details:
– Total number of images: 475
– unique patients: 69
– Unique tags: 3
– Basic EDA: Accessible here
– File Type: CSV, tiff, npz
– Total Dataset size: 458.15 MB
– Publisher: National Cancer Institute

6. OASIS dataset

The OASIS (Open Access Series of Imaging Studies) project is focused on creating neuroimaging freely available to the scientific community. There are four variations available for this dataset, namely OASIS-1, OASIS-2, OASIS-3, OASIS-4

Details:
– The total number of subjects: 416
– Age range of subjects: 18 – 96
– Subjects diagnosed with AD (Alzheimer’s disease): 100
– Subjects with images 90 days of initial session: 20
– Missing Rows: Yes
– Dataset size compressed: 15.8 GB
– Dataset size uncompressed: 50 GB
– Detailed paper: Accessible here
– Publisher: Howard Hughes Medical Institute 

7. Musculoskeletal Radiographs (MURA)

MURA is one of the largest datasets that is available to the public for understanding and differentiating a normal X-ray study from an abnormal one.

Details:
– The total number of studies: 14, 863
– Total number of patients: 12, 173
– Total number of multi-view radiographic images: 40, 561
– Detailed paper: Accessible here
– Publisher: Stanford ML Group

Tabular and Time Series (Non-CV)

8. Merck Molecular Health Activity Challenge

This dataset is designed to advance the practice of Machine Learning implementations in the field of drug discovery through simulated molecular interactions.

Details:
– Database size: 123.5 MB
– Missing data present: Yes
– File types: zip, CSV
– Publisher:

9. 1000 Genomes Project

This is one of the biggest genome libraries that is publicly accessible. The dataset is created with international collaboration and is accessible via AWS hosting for public use.

Details:
– Total number of individuals: 2500
– Unique Populations: 26
– Total Variants: 80 million
– File types: BAM, CRAM, VCF, BAS
– Documentation Details: Accessible here
– Detailed paper: Accessible here

10. Inpatient Rehab General Dataset

This database is made available by the U.S. government and contributed to by various healthcare organizations. The datasets are meant to provide analytics for interesting insights into the service side of hospital care. The specific dataset in question highlights the characteristics of the inpatient rehabilitation facilities on the mediacare.gov website.

Details:
– Total number of rows: 1196
– Data Type: Tabular Data
– Publisher: Centers for Medicare & Medicaid Services (CMS)– Missing data present: Yes
– File Type: CSV

Bonus: Data Aggregators

This dataset is a selected subset of all the open-source and free medical datasets. There are a number of other open-source health datasets that can be accessed by exploring aggregators like:
– OpenNEURO
Kaggle
UCI Machine Learning Repo

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo