Generate Healthcare Synthetic Data in Python
Healthcare data is among the most sensitive in existence — HIPAA, GDPR, and a dozen other regulations govern who can access real patient records. Yet developers building EHR systems, researchers training clinical ML models, and teams building healthcare analytics tools all need realistic patient data that behaves like the real thing. Misata generates fully synthetic healthcare data: patients with statistically accurate blood type distributions, doctors with realistic specialty assignments, and appointments with realistic no-show rates and duration distributions.
No real patient records are ever used or exposed. Every name, date of birth, and diagnosis is generated from statistical priors — realistic enough to power your analytics queries, safe enough to share in any environment.
import misata
tables = misata.generate("A hospital with 500 patients and 50 doctors", rows=500, seed=42)
print(list(tables.keys())) # ['doctors', 'patients', 'appointments']
print(tables["patients"][["blood_type", "diagnosis"]].head())
What Misata generates
Three tables: doctors, patients, and appointments. Appointments reference both a patient and a doctor, enforcing complete referential integrity. Patient demographics match real-world chronic care population distributions.
Tables and columns
| Table | Key columns |
|---|---|
doctors |
doctor_id, name, specialty, department, years_experience, rating |
patients |
patient_id, name, date_of_birth, blood_type, gender, diagnosis, insurance_provider |
appointments |
appointment_id, patient_id, doctor_id, scheduled_at, duration_minutes, status, notes |
Realistic distributions
- Blood types match real ABO/Rh frequencies: O+ 37.4%, A+ 35.7%, B+ 8.5%, AB+ 3.4%, and negative variants — not uniform random
- Patient ages are centered on chronic-care population (μ=52, σ=18) — not uniformly distributed from 0–100
- No-show rate is ~15%, matching published hospital no-show statistics
- Doctor specialties drawn from realistic distribution: internal medicine, cardiology, orthopedics, general surgery, pediatrics, and more
- Appointment duration lognormal with median ~25 minutes — shorter for follow-ups, longer for new patient visits
Quick start
import misata
tables = misata.generate("A hospital with 500 patients and 50 doctors", rows=500, seed=42)
# Blood type distribution matches real ABO/Rh frequencies
print(tables["patients"]["blood_type"].value_counts(normalize=True).head())
# O+ 0.374
# A+ 0.357
# B+ 0.085
# ...
# Appointment status breakdown
print(tables["appointments"]["status"].value_counts())
# completed 0.72
# no_show 0.15
# cancelled 0.08
# scheduled 0.05
Common use cases
- EHR system development — populate a test database with patients, appointments, and doctor schedules before your healthcare app goes live
- Clinical ML model training — generate training data for readmission prediction, no-show prediction, or diagnosis classification with realistic demographic distributions
- Healthcare analytics dashboards — build utilization reports, specialty throughput charts, and appointment funnel analyses on realistic data
- HIPAA-compliant data sharing — replace real patient exports with statistically equivalent synthetic data for vendor integration testing
- Appointment scheduling algorithm testing — validate your optimization logic against thousands of appointments across multiple specialties
- Medical billing integration testing — generate complete patient-appointment-billing pipelines without exposing real insurance information
Advanced: patient volume curves
Model seasonal appointment patterns — flu season spikes, elective surgery drops in summer:
tables = misata.generate(
"Hospital with 2k patients — flu season surge November through February, "
"elective surgery dip in August, steady growth overall",
rows=2000,
seed=42,
)
# Monthly appointment volume follows the seasonal pattern
import pandas as pd
appts = tables["appointments"].copy()
appts["month"] = pd.to_datetime(appts["scheduled_at"]).dt.month
print(appts.groupby("month").size())
Advanced: locale-aware generation
# Indian hospital — Indian names, regional diagnoses, INR billing
tables = misata.generate("Indian multi-specialty hospital with 1k patients", rows=1000)
# German clinic — German names, German insurance providers
tables = misata.generate("German private clinic with 300 patients", rows=300)
Advanced: quality-guaranteed generation
tables = misata.generate(
"Hospital with 1000 patients",
min_quality_score=85,
smart_correlations=True, # auto-correlates age↔diagnosis frequency
rows=1000,
seed=42,
)
HIPAA-safe by design
All patient data is generated, never sampled from real records:
- Names are generated from locale-appropriate name distributions
- Dates of birth are statistically derived, not from real people
- Diagnoses are drawn from ICD-10 category distributions, not real patient charts
- Insurance provider names are synthetic
Safe for development, staging, ML training, and vendor demos without a BAA or privacy review.