Validate & Preview
Misata provides three separate inspection tools — all of which produce no synthetic data:
| Tool | Purpose |
|---|---|
preview() |
Inspect what a plain-English story would generate before committing |
validate_csv() |
Profile a CSV for quality issues and null rates |
validate_schema() |
Check a misata.yaml against the JSON Schema |
preview() — Inspect story detection
preview() parses a plain-English story and returns everything Misata understood: domain, scale, locale, and the table layout — without generating a single row. Use it before any large generate() call to catch ambiguity or misconfiguration early.
import misata
report = misata.preview(
"A SaaS company with 5k users, MRR from $50k in Jan to $200k in Dec"
)
print(report.domain) # "saas"
print(report.domain_confidence) # "high"
print(report.matched_keywords) # ["saas", "mrr"]
print(report.scale_params) # {"users": 5000}
print(report.locale) # None
print(report.table_preview)
# [{"name": "users", "rows": 5000, "columns": 12},
# {"name": "subscriptions", "rows": 5000, "columns": 8},
# {"name": "invoices", "rows": 20000, "columns": 6}]
print(report.temporal_events) # [{"type": "growth", "value": null}]
print(report.warnings) # []
print(report.summary())
# ✓ Domain: saas [high] matched: saas, mrr
# ✓ Scale: users=5,000
# ✓ Events: 2 detected
#
# Will generate 3 table(s), 30,000 total rows:
# users 5,000 rows (12 columns)
# subscriptions 5,000 rows (8 columns)
# invoices 20,000 rows (6 columns)
DetectionReport reference
| Field | Type | Description |
|---|---|---|
domain |
str \| None |
Detected domain code ("saas", "fintech", …) or None if no domain matched |
domain_confidence |
str |
"high" if ≥2 keywords matched, "low" if 1 keyword, "none" if nothing matched |
matched_keywords |
list[str] |
The specific keywords that fired for the winning domain |
near_misses |
dict[str, list[str]] |
Other domains whose keywords also appeared — useful for diagnosing ambiguity |
scale_params |
dict[str, int] |
Parsed numeric scale signals (e.g. {"users": 5000}) |
temporal_events |
list[dict] |
Growth, churn, crash events detected (used to build outcome curves) |
locale |
str \| None |
Auto-detected locale code ("de_DE", "pt_BR", …), or None |
table_preview |
list[dict] |
[{name, rows, columns}] for each table that would be generated |
total_rows |
int |
Sum of row counts across all tables |
warnings |
list[str] |
Fallback and ambiguity warnings |
Interpreting confidence levels
report = misata.preview("A platform with crypto wallets and subscription payments")
print(report.domain_confidence) # "low" — ambiguous
print(report.near_misses) # {"crypto": ["crypto", "wallet"], "saas": ["subscription"]}
print(report.warnings)
# ["Domain 'fintech' matched on 1 keyword — consider naming the domain explicitly"]
# Fix: name the domain so it earns the +5 literal bonus
report2 = misata.preview("A fintech platform with crypto wallets and subscription payments")
print(report2.domain_confidence) # "high"
print(report2.domain) # "fintech"
Detection scoring: - +5 if the literal domain name appears in the story - +1 per matched keyword
The highest-scoring domain wins. Naming the domain explicitly ("fintech", "saas", etc.) always produces a "high" confidence result.
Workflow: preview then generate
import misata
story = "A fintech with 5k customers, Black Friday spike"
# Step 1 — inspect for free
report = misata.preview(story, rows=5000)
if report.domain_confidence == "none":
print("No domain detected:", report.warnings)
exit()
# Step 2 — optional: inspect full schema (tables, columns, FK relationships)
schema = misata.parse(story, rows=5000)
print(schema.summary())
# Step 3 — generate
tables = misata.generate_from_schema(schema, seed=42)
CLI
validate_csv() — CSV quality profiling
Profile any CSV file for null rates, type inference, range statistics, and uniqueness violations. Returns a ValidationReport with a 0–100 quality score.
CLI
# Profile a CSV — type inference, null rates, range stats
misata validate customers.csv
# Check against a YAML schema
misata validate orders.csv --schema misata.yaml
# Parse a story and validate the CSV against it
misata validate orders.csv --story "A SaaS company with an orders table"
Example output:
Validating 'customers' — 5,000 rows × 8 columns
──────────────────────────────────────────────────────────────────────────
Column Type Nulls Range / Values Notes
──────────────────────────────────────────────────────────────────────────
customer_id int 0.0% 1 → 5000 unique · schema ✓
email text 0.2% 4,989 unique 0.2% nulls · schema ✓
plan categorical 0.0% free, pro, enterprise schema ✓
mrr float 0.0% 0.0 → 2,399.8 schema ✓
signup_date date 0.0% 2022-01-03 → 2024-12-29 unique
credit_score int 1.1% 582 → 848 1.1% nulls
is_active boolean 0.0%
country text 0.0% 94 unique
──────────────────────────────────────────────────────────────────────────
Quality score: 94/100
1 issue(s) found:
· credit_score: 1.1% nulls — column may be mostly empty
Python API
import misata
# Profile only
report = misata.validate_csv("customers.csv")
print(report.score) # 0–100
print(report.issues) # list of strings
print(report) # formatted table
# With a YAML schema — checks types, required columns, value constraints
schema = misata.load_yaml_schema("misata.yaml")
report = misata.validate_csv("customers.csv", schema=schema, table_name="customers")
# From a DataFrame directly
import pandas as pd
df = pd.read_csv("customers.csv")
report = misata.validate_csv(df, table_name="customers")
ValidationReport attributes
| Attribute | Type | Description |
|---|---|---|
score |
int |
Quality score 0–100. Deductions for nulls >50%, type mismatches, uniqueness violations |
rows |
int |
Number of rows profiled |
columns |
list[dict] |
Per-column stats: name, inferred type, null rate, range, notes |
issues |
list[str] |
Human-readable list of detected problems |
validate_schema() — YAML schema validation
Misata ships a JSON Schema (misata/_schemas/misata_schema.json) that describes the full misata.yaml format. Running validate_schema() runs two layers of validation:
- Structural — JSON Schema checks (required fields, types, allowed values)
- Semantic — domain-aware checks (FK targets exist, distribution params are in range, formula columns reference existing columns)
CLI
Python API
import misata
result = misata.validate_schema("misata.yaml")
if result.valid:
print("Schema is valid")
else:
for error in result.errors:
print(error)
# Example errors:
# tables[0].columns[2]: 'distribution' must be one of: uniform, normal, lognormal, ...
# tables[1].columns[0]: formula references column 'gross_pay' which is not defined
Editor auto-complete
Add the $schema pointer to your misata.yaml to get in-editor validation and auto-complete in VS Code, PyCharm, and any editor that supports JSON Schema:
# misata.yaml
$schema: "https://rasinmuhammed.github.io/misata/schema/misata_schema.json"
tables:
- name: users
rows: 5000
columns:
- name: user_id
type: int
unique: true
- name: email
type: email
- name: plan
type: categorical
values: [free, pro, enterprise]
VS Code will highlight unknown keys, warn on invalid distribution names, and auto-complete column type names as you type.
What the schema checks
| Check | Example error |
|---|---|
| Required fields present | tables[0]: 'name' is required |
| Column types valid | type must be one of: int, float, text, email, … |
| Distribution names valid | distribution must be one of: uniform, normal, lognormal, … |
| FK targets exist | foreign_key 'orders.customer_id' references unknown table 'users' |
| Formula columns reference existing columns | formula 'gross_pay * rate' references undefined column 'rate' |
| Distribution params in range | normal mean must be a number |
Inspecting narrative curves
When a story contains growth patterns, parse() returns OutcomeCurve objects that show the exact monthly targets that will shape generated data:
import misata
schema = misata.parse(
"SaaS mrr from $50k in Jan to $200k in Dec, Q3 slump, Black Friday spike",
rows=5000
)
for oc in schema.outcome_curves:
print(f"Curve: {oc.table}.{oc.column}")
for pt in oc.curve_points:
print(f" Month {pt['month']:2d}: ${pt['target_value']:,.0f}")
Output:
Curve: subscriptions.mrr
Month 1: $50,000
Month 2: $63,636
Month 3: $77,273
Month 4: $90,909
Month 5: $104,545
Month 6: $118,182
Month 7: $82,727 ← Q3 slump (×0.72)
Month 8: $82,727 ← Q3 slump (×0.72)
Month 9: $82,727 ← Q3 slump (×0.72)
Month 10: $145,455
Month 11: $163,636 ← Black Friday (×1.55 on interpolated value)
Month 12: $200,000
This lets you verify the curve before generating 100k rows. Full narrative patterns guide →