Skip to content

Plain-English Generation

The fastest path to multi-table synthetic data: write one sentence, get back a dict of DataFrames with referential integrity, realistic distributions, and locale-accurate values.

import misata

tables = misata.generate(
    "A fintech startup with 10k customers, 3% fraud rate, and IBAN accounts",
    rows=10_000,
    seed=42,
)
# Returns: {"customers": DataFrame, "accounts": DataFrame, "transactions": DataFrame}

What the parser extracts

Misata's StoryParser reads the story and infers four things before a single row is generated:

Signal Example phrase What happens
Domain "fintech", "saas", "ecommerce" Selects the domain schema (tables, columns, FK relationships)
Scale "10k customers", "500 employees" Sets row counts; child tables scale proportionally
Locale "German company", "Brazilian fintech" Applies country-accurate names, salaries, national IDs, phone prefixes
Growth curves "MRR from $50k in Jan to $200k in Dec" Shapes numeric distributions to match exact monthly targets

Preview before generating

Use preview() to confirm what Misata understood before committing to a large generation:

import misata

report = misata.preview(
    "A SaaS company with 5k users, MRR from $50k in Jan to $200k in Dec"
)

print(report.domain)            # "saas"
print(report.domain_confidence) # "high"
print(report.matched_keywords)  # ["saas", "mrr"]
print(report.scale_params)      # {"users": 5000}
print(report.locale)            # None (no locale detected)
print(report.table_preview)
# [{"name": "users", "rows": 5000, "columns": 12},
#  {"name": "subscriptions", "rows": 5000, "columns": 8}]
print(report.warnings)          # [] — clean detection

print(report.summary())
# ✓ Domain: saas  [high]  matched: saas, mrr
# ✓ Scale: users=5,000
# ✓ Events: 2 detected
#
#   Will generate 2 table(s), 10,000 total rows:
#     users          5,000 rows  (12 columns)
#     subscriptions  5,000 rows  (8 columns)

preview() calls no generators and produces no data — it is pure inspection.

DetectionReport fields

Field Type Description
domain str \| None Detected domain code or None
domain_confidence str "high" (≥2 keywords), "low" (1 keyword), "none"
matched_keywords list[str] Keywords from the winning domain that appeared in the story
near_misses dict[str, list[str]] Other domains whose keywords also appeared
scale_params dict[str, int] Parsed numeric scale signals
temporal_events list[dict] Growth, churn, crash events detected
locale str \| None Auto-detected locale code (e.g. "de_DE")
table_preview list[dict] [{name, rows, columns}] for every table
total_rows int Sum of all table row counts
warnings list[str] Fallback / ambiguity warnings

Domain detection — how it scores

Detection is scored, not first-match. For each domain:

  • +5 if the literal domain name appears in the story (e.g. "fintech" → fintech domain gets +5)
  • +1 per matched keyword

The highest-scoring domain wins. This means "a fintech company with churn" correctly detects as fintech even though "churn" is a SaaS keyword — "fintech" earns +5 and beats the single SaaS keyword hit.

If two stories are ambiguous, the near_misses field tells you which other domains also matched.

report = misata.preview("A fintech company with crypto wallets and 5k users")
print(report.domain)        # "fintech"  (+5 for "fintech" literal)
print(report.near_misses)   # {"crypto": ["crypto", "wallet"]}

Disambiguation tip

Name the domain explicitly and it always wins:

# Ambiguous
misata.generate("A platform with subscription payments and crypto wallets")

# Unambiguous — fintech wins because the word "fintech" scores +5
misata.generate("A fintech platform with subscription payments and crypto wallets")

Scale extraction

Any of these forms are recognised:

1000 users       → users: 1000
5k users         → users: 5000
1.5M customers   → users: 1500000
200 employees    → users: 200
500 doctors      → users: 500
10k orders       → orders: 10000
50k transactions → transactions: 50000

Child tables scale proportionally based on the domain's FK cardinality ratios. A SaaS company with 5k users automatically produces ~5k subscriptions and ~20k invoices (4× ratio).


Narrative growth curves

This is Misata's core differentiator: natural language maps to exact per-month targets that shape the generated data. Specify them in any order; Misata interpolates between control points.

Monthly anchors

# From–to with interpolation
misata.generate("SaaS company — MRR from $50k in January to $200k in December")

# Multiple control points
misata.generate("SaaS mrr $50k in Jan, $90k in June, $200k in December")

# Mixed: anchors + qualitative modifiers
misata.generate("SaaS mrr $50k in Jan, peak in November, $200k in Dec")

Quarterly patterns

Quarter keywords expand to all three constituent months:

# "Q4 spike" → months 10, 11, 12 all boosted by 1.3×
misata.generate("Ecommerce orders — Q4 spike, Q1 slump")

# "strong Q4" → months 10, 11, 12 lifted by 1.15×
misata.generate("SaaS revenue — strong Q4, flat Q2")

# Quarter-level anchors
misata.generate("SaaS mrr — $100k in Q1, $150k in Q2, $200k in Q3, $250k in Q4")
Pattern Months affected Factor
Q1 dip / slump Jan, Feb, Mar 0.7×
Q2 flat Apr, May, Jun 1.0×
Q3 peak / spike Jul, Aug, Sep 1.25–1.3×
Q4 push / strong Oct, Nov, Dec 1.15–1.2×

Named seasonal events

misata.generate("Ecommerce orders — Black Friday spike, Christmas peak")
misata.generate("EdTech enrollments — back to school surge")
misata.generate("SaaS signups — New Year spike, summer slump")
Event phrase Month Factor
Black Friday November 1.55×
Cyber Monday / Cyber Week November 1.4–1.45×
Christmas / Xmas December 1.4×
Holiday season / Festive season December 1.3–1.35×
New Year January 1.25×
Valentine February 1.2×
Tax season April 1.2×
Back to school August 1.2×
Summer slump / Slow summer July + August 0.75× each

Relative multipliers

When you know the end-state but not the absolute numbers, use a multiplier:

# Pure multiplier — Misata derives a sensible baseline and scales it
misata.generate("SaaS startup — MRR 10x growth over the year")
misata.generate("Fintech transaction volume doubled over the year")
misata.generate("Ecommerce GMV tripled in one year")

# Multiplier + one anchor — uses the anchor as the pivot
# Jan is pinned at $50k; Dec is derived as $100k (2× Jan)
misata.generate("SaaS mrr $50k in January, doubled by December")

# Halved (decline story)
misata.generate("SaaS revenue halved after the pivot")
Word form Factor
halved 0.5×
doubled / 2x
tripled / 3x
quadrupled / 4x
5x / 10x 5× / 10×
grew 300% 4× (1 + 3.0)

Qualitative month modifiers

misata.generate("SaaS mrr — dip in March, peak in November")
misata.generate("Ecommerce orders — slump in January, boom in December")
Keyword Factor
crash 0.5×
dip / drop / slump 0.7–0.72×
decline 0.75×
slow / low 0.8×
flat 1.0×
strong / push 1.15–1.2×
high 1.2×
peak 1.25×
boom / spike / surge 1.3×

Trigger tokens

A curve is only built when the story contains at least one of these signal words:

revenue, sales, mrr, arr, gmv, amount, orders, bookings, transactions, volume, churn, growth, peak, dip, spike, surge, drop, decline, slump, boom, doubled, tripled, halved, black friday, christmas, summer slump, q1, q2, q3, q4


All 18 domains

Domain Trigger keywords Tables
saas saas, subscription, mrr, arr, churn users, subscriptions, invoices
ecommerce ecommerce, orders, store, retail, cart customers, products, orders, order_items
fintech fintech, payments, banking, fraud, wallet customers, accounts, transactions
healthcare healthcare, patients, doctors, clinic, hospital doctors, patients, appointments
marketplace marketplace, sellers, buyers, listings, freelance sellers, buyers, listings, orders
logistics logistics, shipping, drivers, fleet, routes drivers, vehicles, routes, shipments
hr hr, employees, payroll, workforce, headcount departments, employees, payroll
social social media, instagram, tiktok, followers, feed users, posts, follows, reactions, comments
realestate real estate, housing, mortgage, listings agents, properties, transactions
pharma pharma, clinical, trials, research researchers, projects, trials, timesheets
fooddelivery food delivery, restaurants, takeout, doordash restaurants, customers, couriers, orders, order_items
edtech edtech, courses, students, enrollments, lms instructors, courses, students, enrollments, quiz_attempts
gaming gaming, players, leaderboard, esports, matches players, matches, sessions, achievements
crm crm, contacts, deals, pipeline, salesforce companies, contacts, deals, activities
crypto crypto, blockchain, ethereum, defi, wallet wallets, tokens, transactions, token_prices
insurance insurance, policy, claims, premium customers, policies, claims, payments
travel travel, hotel, flights, bookings, airbnb users, hotels, flights, bookings, reviews
streaming streaming, netflix, subscribers, watch history subscribers, content, watch_history, ratings

Detailed domain reference with column listings →


Step-by-step: inspect then generate

import misata

# Step 1 — preview (zero rows generated)
report = misata.preview("A fintech with 5k customers, Black Friday spike", rows=5000)
if report.domain_confidence == "none":
    print("⚠ No domain detected")
    print(report.warnings)

# Step 2 — inspect full schema
schema = misata.parse("A fintech with 5k customers, Black Friday spike", rows=5000)
print(schema.summary())
# Tables: customers, accounts, transactions
# Outcome curves: 1 (transactions.amount, monthly)

# Step 3 — generate
tables = misata.generate_from_schema(schema, seed=42)
print(tables["transactions"].head())

Tips

Be explicit about scale: "5k users" is always clearer than "a medium-sized company".

Name the domain: "A fintech company with..." always wins over a story that only uses secondary keywords.

Combine anchors freely: Monthly anchors, quarter patterns, named events, and multipliers can all appear in the same story. Named events and quarter patterns stack multiplicatively.

Use seed for reproducibility: Same seed + same story = byte-identical output every time.

Switch to LLM for open-ended stories: If your story doesn't fit any of the 18 domains, LLMSchemaGenerator can interpret it using a large language model:

from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq")
schema = gen.generate_from_story("A B2B API platform with rate limits and invoicing")
tables = misata.generate_from_schema(schema)