Mimic Mode
Mimic mode takes a real CSV file (or DataFrame), analyzes every column's statistical fingerprint, and produces a fresh synthetic dataset that matches the original's structure — without reusing a single real value.
It's the fastest path from "I have sensitive production data" to "I have safe, shareable synthetic data".
Quickstart
import misata
# From a CSV file
tables = misata.mimic("customers.csv")
# Scale to a different size
tables = misata.mimic("customers.csv", rows=100_000)
# From a DataFrame you already have
import pandas as pd
df = pd.read_csv("orders.csv")
synthetic = misata.mimic(df, rows=50_000)
CLI
# Same number of rows as the source
misata mimic customers.csv
# Scale up to 100 k rows, write to ./synthetic/
misata mimic orders.csv --rows 100000 --output ./synthetic
# Reproducible output
misata mimic data.csv --seed 42
Multi-table
Pass a list of files. Each becomes its own synthetic table, named after the file stem:
tables = misata.mimic(["customers.csv", "orders.csv", "products.csv"])
# tables["customers"], tables["orders"], tables["products"]
How it works
For each column, Misata's DataProfiler runs a five-step analysis:
| Step | What it does |
|---|---|
| Type detection | Identifies boolean, date, integer, float, or text |
| Distribution fitting | Fits lognormal (right-skewed), normal, or uniform — whichever matches the data |
| Cardinality check | Low-cardinality columns become categoricals with real frequency weights |
| Semantic inference | Detects email, name, city, country, latitude, URL, phone, etc. |
| Range capture | Records min/max for numerics, start/end for dates |
None of the original values are stored or emitted. The profiler only retains statistical parameters.
Distribution fitting logic
if all values > 0 AND skew > 1.0:
→ lognormal (mu, sigma from log-space moments)
elif cardinality < 5% of rows AND unique values < 200:
→ categorical with observed frequencies
else:
→ normal (mean, std)
Supported column types
| Source column | What Misata generates |
|---|---|
email |
Realistic email addresses (never real ones) |
first_name / last_name |
Diverse synthetic names |
city, country, state |
Real place names from vocab |
latitude / longitude |
Coordinates clustered around real cities |
postal_code / zip |
Format-correct postal codes |
| Dates | Dates within the observed range |
| Booleans | Same true/false ratio |
| Low-cardinality text | Category distribution preserved |
| High-cardinality text | Semantic type inferred and regenerated |
Reproducibility
# Same seed → identical output every run
tables = misata.mimic("customers.csv", rows=10_000, seed=42)
When to use mimic vs. generate
| Situation | Use |
|---|---|
| You have a real schema to copy | mimic() |
| You want to describe a dataset from scratch | generate() |
| You need relational FK integrity across tables | generate() with relationships |
| You need to match an existing DB's distributions | mimic() with multiple CSVs |