Multi-Table Synthetic Data in Python
Most synthetic data tools generate rows one table at a time. That breaks the moment you need two tables to reference each other. Misata is designed specifically for multi-table relational generation.
The problem with single-table generators
# Faker approach — you write FK logic by hand
import random
from faker import Faker
fake = Faker()
customer_ids = list(range(1, 1001))
customers = [{"id": i, "email": fake.email()} for i in customer_ids]
# Nothing stops order.customer_id from referencing a non-existent customer
orders = [{"id": j, "customer_id": random.randint(1, 10000)} for j in range(5000)]
# ^ broken — 10x the valid range, no referential integrity enforced
Every relationship is your problem. Misata handles this automatically.
One-liner multi-table generation
import misata
# Story → multiple DataFrames with guaranteed FK integrity
tables = misata.generate("An ecommerce store with 1000 customers and orders.", seed=42)
customers = tables["customers"] # 1,000 rows
orders = tables["orders"] # 5,000+ rows
# FK integrity is automatic — no orphans
assert (~orders["customer_id"].isin(customers["customer_id"])).sum() == 0
Three-table chains
tables = misata.generate(
"A fintech company with 2000 customers and banking transactions.",
seed=42,
)
# customers → accounts → transactions (two FK hops)
customers = tables["customers"] # 2,000 rows
accounts = tables["accounts"] # ~4,000 rows
transactions = tables["transactions"] # ~20,000 rows
# Both FK edges hold
assert (~accounts["customer_id"].isin(customers["customer_id"])).sum() == 0
assert (~transactions["account_id"].isin(accounts["account_id"])).sum() == 0
How it works
Misata generates tables in topological dependency order:
- Parent tables are generated first (e.g.
customers). - Primary key pools are collected.
- Child tables sample FK columns from the parent pool — every value is valid by construction.
- The process repeats depth-first down the relationship graph.
Circular dependencies are detected before generation starts and raise a clear error.
1M-row relational dataset
tables = misata.generate(
"A large retail company with 50000 customers, 5000 products, and 1 million orders.",
seed=42,
)
# Generates in ~2 seconds
# regions: 10 rows
# categories: 20 rows
# customers: 50,000 rows
# products: 5,000 rows
# orders: 1,000,000 rows
# All FK edges intact
Inspecting the schema first
schema = misata.parse("A hospital with 500 patients and doctors.")
print(schema.summary())
# Schema: HealthcareDataset
# Domain: healthcare
# Tables (3)
# doctors 25 rows [doctor_id, name, specialty, department]
# patients 500 rows [patient_id, name, age, blood_type, ...]
# appointments 1500 rows [appointment_id, patient_id, doctor_id, type, ...]
# Relationships (2)
# patients.patient_id → appointments.patient_id
# doctors.doctor_id → appointments.doctor_id
Building a schema manually
When you need precise control over every column:
from misata import SchemaConfig, Table, Column, Relationship, DataSimulator
import pandas as pd
config = SchemaConfig(
name="Retail Dataset",
seed=42,
tables=[
Table(name="customers", row_count=1_000),
Table(name="orders", row_count=5_000),
Table(name="order_items", row_count=15_000),
],
columns={
"customers": [
Column(name="customer_id", type="int",
distribution_params={"distribution": "uniform", "min": 1, "max": 1000},
unique=True),
Column(name="email", type="text",
distribution_params={"text_type": "email"}),
Column(name="signup_date", type="date",
distribution_params={"start": "2022-01-01", "end": "2024-12-31"}),
],
"orders": [
Column(name="order_id", type="int",
distribution_params={"distribution": "uniform", "min": 1, "max": 5000},
unique=True),
Column(name="customer_id", type="foreign_key", distribution_params={}),
Column(name="order_date", type="date",
distribution_params={"start": "2023-01-01", "end": "2024-12-31"}),
Column(name="total", type="float",
distribution_params={"distribution": "lognormal", "mean": 4.5, "std": 0.8}),
],
"order_items": [
Column(name="item_id", type="int",
distribution_params={"distribution": "uniform", "min": 1, "max": 15000},
unique=True),
Column(name="order_id", type="foreign_key", distribution_params={}),
Column(name="quantity", type="int",
distribution_params={"distribution": "uniform", "min": 1, "max": 5}),
Column(name="unit_price", type="float",
distribution_params={"distribution": "lognormal", "mean": 3.5, "std": 0.6}),
],
},
relationships=[
Relationship(parent_table="customers", child_table="orders",
parent_key="customer_id", child_key="customer_id"),
Relationship(parent_table="orders", child_table="order_items",
parent_key="order_id", child_key="order_id"),
],
)
sim = DataSimulator(config)
tables = {name: pd.concat([tables.get(name, pd.DataFrame()), batch], ignore_index=True)
if name in tables else batch
for name, batch in sim.generate_all()}
Performance
| Dataset size | Tables | Generation time |
|---|---|---|
| 10,000 rows | 2 | < 0.1 s |
| 100,000 rows | 3 | < 0.5 s |
| 1,000,000 rows | 5 | ~1.5 s |