Quick Start
Install
Optional extras:
pip install "misata[mcp]" # MCP server — use Misata from Claude / Cursor
pip install "misata[llm]" # Groq / OpenAI / Claude / Gemini / Ollama schema generation
pip install "misata[documents]" # PDF output via weasyprint
pip install "misata[advanced]" # SDV/CTGAN statistical synthesis
Your first dataset
Write one sentence. Get back a dict[str, pd.DataFrame] with referential integrity, realistic distributions, and locale-accurate values.
import misata
tables = misata.generate(
"A SaaS company with 5k users, monthly subscriptions, and 20% churn",
seed=42,
)
for name, df in tables.items():
print(f"{name}: {len(df):,} rows")
print(df.head(3))
print()
# users: 5,000 rows
# subscriptions: 5,000 rows
# invoices: 20,000 rows
Preview before generating
Use preview() to confirm what Misata understood before committing to a large generation run. It calls no generators and produces no rows.
import misata
report = misata.preview(
"A fintech startup with 10k customers, 3% fraud rate, and IBAN accounts"
)
print(report.domain) # "fintech"
print(report.domain_confidence) # "high"
print(report.matched_keywords) # ["fintech", "fraud"]
print(report.scale_params) # {"users": 10000}
print(report.locale) # None
print(report.table_preview)
# [{"name": "customers", "rows": 10000, "columns": 9},
# {"name": "accounts", "rows": 10000, "columns": 6},
# {"name": "transactions", "rows": 80000, "columns": 8}]
print(report.warnings) # [] — clean detection
print(report.summary())
# ✓ Domain: fintech [high] matched: fintech, fraud
# ✓ Scale: users=10,000
# ✓ Events: 0 detected
#
# Will generate 3 table(s), 100,000 total rows:
# customers 10,000 rows (9 columns)
# accounts 10,000 rows (6 columns)
# transactions 80,000 rows (8 columns)
DetectionReport fields
| Field | Type | Description |
|---|---|---|
domain |
str \| None |
Detected domain code or None |
domain_confidence |
str |
"high" (≥2 keywords), "low" (1 keyword), "none" |
matched_keywords |
list[str] |
Keywords that fired for the winning domain |
near_misses |
dict[str, list[str]] |
Other domains whose keywords also appeared |
scale_params |
dict[str, int] |
Parsed numeric scale signals |
temporal_events |
list[dict] |
Growth, churn, crash events detected |
locale |
str \| None |
Auto-detected locale code (e.g. "de_DE") |
table_preview |
list[dict] |
[{name, rows, columns}] for each table |
total_rows |
int |
Sum of all table row counts |
warnings |
list[str] |
Fallback and ambiguity warnings |
Narrative growth curves
Describe a growth trajectory in plain English — Misata builds exact per-month targets and shapes generated data to match.
# Ecommerce with seasonal story
tables = misata.generate(
"Ecommerce store with 10k customers — "
"revenue from $200k in Jan to $350k in Sep, "
"Black Friday spike, Christmas peak, Q1 slump after holidays",
rows=10_000, seed=42
)
# SaaS hockey-stick
tables = misata.generate(
"SaaS startup with 2k users — MRR $5k in January, 10x growth over the year, "
"strong Q4 push",
rows=2000, seed=42
)
# B2B with summer slump
tables = misata.generate(
"B2B SaaS platform with 1k enterprise customers — "
"ARR $500k in Jan, doubled by December, summer slump",
rows=1000, seed=42
)
All three pattern types compose freely:
| Pattern type | Example phrase | What it does |
|---|---|---|
| Monthly anchor | "MRR $50k in January" |
Pins exact value for that month |
| Quarter modifier | "Q4 spike" |
Boosts Oct/Nov/Dec by 1.3× |
| Named event | "Black Friday spike" |
November +1.55× |
| Multiplier | "doubled by December" |
End value = 2× start |
| From–to | "from $50k to $200k" |
Linear interpolation across the year |
Full narrative patterns reference →
All 18 domains
Misata ships with schemas for 18 industry verticals:
| Domain | Trigger keywords | Tables |
|---|---|---|
saas |
saas, mrr, arr, churn | users, subscriptions, invoices |
ecommerce |
ecommerce, orders, retail | customers, products, orders, order_items |
fintech |
fintech, payments, fraud | customers, accounts, transactions |
healthcare |
healthcare, patients, clinic | doctors, patients, appointments |
marketplace |
marketplace, sellers, listings | sellers, buyers, listings, orders |
logistics |
logistics, shipping, fleet | drivers, vehicles, routes, shipments |
hr |
hr, employees, payroll | departments, employees, payroll |
social |
social media, followers, feed | users, posts, follows, reactions, comments |
realestate |
real estate, mortgage | agents, properties, transactions |
pharma |
pharma, clinical, trials | researchers, projects, trials, timesheets |
fooddelivery |
food delivery, restaurants | restaurants, customers, couriers, orders |
edtech |
edtech, courses, students | instructors, courses, students, enrollments |
gaming |
gaming, players, leaderboard | players, matches, sessions, achievements |
crm |
crm, contacts, deals | companies, contacts, deals, activities |
crypto |
crypto, blockchain, defi | wallets, tokens, transactions, token_prices |
insurance |
insurance, policy, claims | customers, policies, claims, payments |
travel |
travel, hotel, flights | users, hotels, flights, bookings, reviews |
streaming |
streaming, netflix, subscribers | subscribers, content, watch_history, ratings |
Full domain reference with column listings →
Inspect the schema first
schema = misata.parse("A fintech company with 10k customers")
print(schema.summary())
# Tables: customers, accounts, transactions
# Relationships: customers → accounts → transactions
# Rows: 10,000 / 25,000 / 80,000
tables = misata.generate_from_schema(schema, seed=42)
Locale support
Add a country to get locale-accurate names, addresses, national IDs, phone prefixes, and currency-appropriate values:
# German names, IBAN accounts, European address format
tables = misata.generate("A German fintech with 5k customers", seed=42)
# Brazilian locale — CPF national IDs, BRL salaries
tables = misata.generate("A Brazilian HR system with 200 employees", seed=42)
# UK healthcare — NHS numbers, British names
tables = misata.generate("A UK healthcare provider with 1k patients", seed=42)
Export
misata.to_csv(tables, "data/")
misata.to_parquet(tables, "data/")
misata.to_duckdb(tables, "data/dataset.duckdb")
misata.to_jsonl(tables, "data/")
CLI
# Generate from a story
misata generate --story "A marketplace with 10k sellers" --rows 10000
# Generate from YAML schema
misata init # scaffold misata.yaml
misata generate # reads misata.yaml automatically
# Profile a CSV file
misata validate customers.csv
# Preview what would be generated (no rows)
misata preview --story "A SaaS company with 5k users"
Use from an AI assistant
If you have misata[mcp] installed, wire it into Claude Desktop, Cursor, or Windsurf and describe datasets in plain English:
"Generate a fintech fraud dataset with 10k customers."
"Build me SaaS subscription data with MRR growing from $50k to $200k over the year."