Skip to content

Python Synthetic Data Generator

Misata is a Python synthetic data generator for teams who need more than random rows.

If you are searching for a library that can generate realistic multi-table test data or scenario-based business data from a plain-English description, this page explains what Misata does and how to get started.

Quick start (no config required)

pip install misata
import misata

# One line — story → dict of DataFrames
tables = misata.generate("A SaaS company with 5000 users and monthly subscriptions.", seed=42)

print(tables["users"].head())
#    user_id                 email           name signup_date
# 0        1  tricia23@example.com  Patricia Müller  2023-04-12
# ...

print(tables["subscriptions"].head())
#    subscription_id  user_id     plan     mrr     status  start_date
# 0                1        1  starter   49.00     active  2023-04-15
# ...

What Misata generates

Misata understands 7 business domains out of the box:

Domain Example prompt Tables
SaaS "5k users, 20% churn" users, subscriptions, invoices
Ecommerce "10k orders, seasonal peak" customers, orders
Fintech "2k customers, fraud detection" customers, accounts, transactions
Healthcare "500 patients and doctors" doctors, patients, appointments
Marketplace "sellers and buyers" sellers, buyers, listings, transactions
Logistics "1000 shipments across routes" drivers, routes, shipments
Pharma "clinical trials" patients, trials, compounds, outcomes

Multi-table with referential integrity

Every child table's foreign key column references a valid parent ID — guaranteed, not random:

tables = misata.generate("A fintech company with 2000 customers and banking transactions.", seed=42)

customers    = tables["customers"]     # 2,000 rows
accounts     = tables["accounts"]      # ~4,000 rows
transactions = tables["transactions"]  # ~20,000 rows

# No orphan rows — FK integrity is automatic
orphans = (~transactions["account_id"].isin(accounts["account_id"])).sum()
assert orphans == 0  # always passes

Inspect the schema before generating

schema = misata.parse("An ecommerce store with 10k orders")
print(schema.summary())
# Schema: EcommerceDataset
# Domain: ecommerce
# Tables (2)
#   customers  10,000 rows  [customer_id, email, name, signup_date]
#   orders     60,000 rows  [order_id, customer_id, order_date, amount, status]
# Relationships (1)
#   customers.customer_id → orders.customer_id

# Tweak, then generate
schema.seed = 42
tables = misata.generate_from_schema(schema)

Exact aggregate targets

Misata can pin monthly sums so that rows actually add up to specified targets:

schema = misata.parse(
    "A SaaS company with 1000 users. "
    "MRR rises from $50k in January to $200k in December with a dip in September.",
    rows=1000,
)
tables = misata.generate_from_schema(schema)

# All 12 monthly MRR targets hit exactly — to the cent

Domain-realistic distributions

Misata ships calibrated priors so you don't have to configure them:

  • Credit scores — normal distribution centred on real FICO statistics (mean ≈ 680–720, std ≈ 75)
  • MRR — log-normal, because real SaaS revenue is right-skewed
  • Transaction types — Zipf distribution, because one type always dominates
  • Blood types — exact ABO/Rh frequencies (O+ 38%, A+ 34%, …)
  • Monetary amounts — log-normal with realistic min/max bounds

LLM-powered generation (optional)

When the rule-based parser isn't specific enough, hand off to an LLM:

from misata import LLMSchemaGenerator

gen    = LLMSchemaGenerator(provider="groq")   # or "openai"
schema = gen.generate_from_story("A B2B marketplace with vendor tiers, SLA contracts, and quarterly invoices")
tables = misata.generate_from_schema(schema)

Requires GROQ_API_KEY or OPENAI_API_KEY. Retries automatically on rate limits.

Database seeding

from misata import seed_database

tables = misata.generate("A SaaS company with 1000 users.", seed=42)
report = seed_database(tables, "postgresql://user:pass@localhost/mydb", create=True)
print(report.total_rows)  # 6,000+

Why Misata instead of Faker or SDV

  • vs Faker: Faker generates standalone fake values. Misata generates related tables that reference each other correctly, with business constraints and distribution control.
  • vs SDV: SDV requires real training data. Misata generates from scratch — no data, no model, no privacy risk.

See faker-vs-sdv-vs-misata.md for a full comparison with side-by-side code.