Generate Healthcare Synthetic Data in Python

Healthcare data is among the most sensitive in existence — HIPAA, GDPR, and a dozen other regulations govern who can access real patient records. Yet developers building EHR systems, researchers training clinical ML models, and teams building healthcare analytics tools all need realistic patient data that behaves like the real thing. Misata generates fully synthetic healthcare data: patients with statistically accurate blood type distributions, doctors with realistic specialty assignments, and appointments with realistic no-show rates and duration distributions.

No real patient records are ever used or exposed. Every name, date of birth, and diagnosis is generated from statistical priors — realistic enough to power your analytics queries, safe enough to share in any environment.

import misata

tables = misata.generate("A hospital with 500 patients and 50 doctors", rows=500, seed=42)
print(list(tables.keys()))   # ['doctors', 'patients', 'appointments']
print(tables["patients"][["blood_type", "diagnosis"]].head())

What Misata generates

Three tables: doctors, patients, and appointments. Appointments reference both a patient and a doctor, enforcing complete referential integrity. Patient demographics match real-world chronic care population distributions.

Tables and columns

Table	Key columns
`doctors`	`doctor_id`, `name`, `specialty`, `department`, `years_experience`, `rating`
`patients`	`patient_id`, `name`, `date_of_birth`, `blood_type`, `gender`, `diagnosis`, `insurance_provider`
`appointments`	`appointment_id`, `patient_id`, `doctor_id`, `scheduled_at`, `duration_minutes`, `status`, `notes`

Realistic distributions

Blood types match real ABO/Rh frequencies: O+ 37.4%, A+ 35.7%, B+ 8.5%, AB+ 3.4%, and negative variants — not uniform random
Patient ages are centered on chronic-care population (μ=52, σ=18) — not uniformly distributed from 0–100
No-show rate is ~15%, matching published hospital no-show statistics
Doctor specialties drawn from realistic distribution: internal medicine, cardiology, orthopedics, general surgery, pediatrics, and more
Appointment duration lognormal with median ~25 minutes — shorter for follow-ups, longer for new patient visits

Quick start

import misata

tables = misata.generate("A hospital with 500 patients and 50 doctors", rows=500, seed=42)

# Blood type distribution matches real ABO/Rh frequencies
print(tables["patients"]["blood_type"].value_counts(normalize=True).head())
# O+     0.374
# A+     0.357
# B+     0.085
# ...

# Appointment status breakdown
print(tables["appointments"]["status"].value_counts())
# completed    0.72
# no_show      0.15
# cancelled    0.08
# scheduled    0.05

Common use cases

EHR system development — populate a test database with patients, appointments, and doctor schedules before your healthcare app goes live
Clinical ML model training — generate training data for readmission prediction, no-show prediction, or diagnosis classification with realistic demographic distributions
Healthcare analytics dashboards — build utilization reports, specialty throughput charts, and appointment funnel analyses on realistic data
HIPAA-compliant data sharing — replace real patient exports with statistically equivalent synthetic data for vendor integration testing
Appointment scheduling algorithm testing — validate your optimization logic against thousands of appointments across multiple specialties
Medical billing integration testing — generate complete patient-appointment-billing pipelines without exposing real insurance information

Advanced: patient volume curves

Model seasonal appointment patterns — flu season spikes, elective surgery drops in summer:

tables = misata.generate(
    "Hospital with 2k patients — flu season surge November through February, "
    "elective surgery dip in August, steady growth overall",
    rows=2000,
    seed=42,
)

# Monthly appointment volume follows the seasonal pattern
import pandas as pd
appts = tables["appointments"].copy()
appts["month"] = pd.to_datetime(appts["scheduled_at"]).dt.month
print(appts.groupby("month").size())

Advanced: locale-aware generation

# Indian hospital — Indian names, regional diagnoses, INR billing
tables = misata.generate("Indian multi-specialty hospital with 1k patients", rows=1000)

# German clinic — German names, German insurance providers
tables = misata.generate("German private clinic with 300 patients", rows=300)

Advanced: quality-guaranteed generation

tables = misata.generate(
    "Hospital with 1000 patients",
    min_quality_score=85,
    smart_correlations=True,  # auto-correlates age↔diagnosis frequency
    rows=1000,
    seed=42,
)

HIPAA-safe by design

All patient data is generated, never sampled from real records:

Names are generated from locale-appropriate name distributions
Dates of birth are statistically derived, not from real people
Diagnoses are drawn from ICD-10 category distributions, not real patient charts
Insurance provider names are synthetic

Safe for development, staging, ML training, and vendor demos without a BAA or privacy review.