Generate Synthetic Data from SQL DDL
misata.from_ddl() parses SQL CREATE TABLE statements and converts them into a SchemaConfig ready for data generation. Paste your real schema DDL and get synthetic data that matches your database structure — column types, foreign key relationships, and NOT NULL constraints all preserved.
Quick start
import misata
schema = misata.from_ddl("""
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
created_at TIMESTAMP
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT REFERENCES users(id),
amount DECIMAL(10, 2),
status VARCHAR(50),
placed_at TIMESTAMP
);
""")
tables = misata.generate_from_schema(schema)
print(tables["users"].head())
print(tables["orders"].head())
# FK integrity guaranteed
assert tables["orders"]["user_id"].isin(tables["users"]["id"]).all()
Supported SQL dialects
- PostgreSQL —
SERIAL,BIGSERIAL,TIMESTAMP WITH TIME ZONE, schema-qualified names,"quoted"identifiers - MySQL —
INT AUTO_INCREMENT,DATETIME,TINYINT - SQLite —
INTEGER PRIMARY KEY,TEXT,REAL - Generic ANSI SQL — standard
CREATE TABLEwithREFERENCESclauses
SQL type mapping
| SQL types | Misata type |
|---|---|
INT, INTEGER, BIGINT, SMALLINT, SERIAL |
int |
FLOAT, REAL, DOUBLE, DECIMAL, NUMERIC, MONEY |
float |
VARCHAR, TEXT, CHAR, STRING, UUID, JSON, JSONB |
text |
DATE |
date |
TIMESTAMP, DATETIME (any variant) |
date |
BOOLEAN, BOOL |
boolean |
Foreign key detection
Three sources of FK relationships are supported:
1. Inline REFERENCES clause:
2. Standalone FOREIGN KEY constraint:
3. Naming convention inference (infer_fks=True, default):
Columns ending in _id that don't have an explicit REFERENCES clause are automatically inferred as foreign keys to the table named by the prefix:
Disable with infer_fks=False if you don't want this behavior.
Configuration
schema = misata.from_ddl(
ddl,
infer_fks=True, # infer FKs from _id naming convention (default True)
default_rows=1000, # row count per table (default 1000)
)
Adjust per-table row counts after parsing:
schema = misata.from_ddl(ddl)
# Customize row counts
for table in schema.tables:
if table.name == "users":
table.row_count = 5000
elif table.name == "orders":
table.row_count = 20000
tables = misata.generate_from_schema(schema)
Full example: PostgreSQL schema
import misata
schema = misata.from_ddl("""
CREATE TABLE public.customers (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
country VARCHAR(2),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
CREATE TABLE public.products (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL,
price NUMERIC(10, 2) NOT NULL,
category VARCHAR(100),
created_at TIMESTAMP WITH TIME ZONE
);
CREATE TABLE public.orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL REFERENCES public.customers(id),
status VARCHAR(50) DEFAULT 'pending',
total NUMERIC(12, 2),
placed_at TIMESTAMP WITH TIME ZONE
);
CREATE TABLE public.order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES public.orders(id),
product_id BIGINT NOT NULL REFERENCES public.products(id),
quantity INT NOT NULL DEFAULT 1,
unit_price NUMERIC(10, 2) NOT NULL
);
""", default_rows=500)
# Scale appropriately
for t in schema.tables:
if t.name == "order_items":
t.row_count = 2000
tables = misata.generate_from_schema(schema, min_quality_score=80)
for name, df in tables.items():
print(f"{name}: {len(df)} rows")
Export to your database
Generate and seed directly:
import misata
schema = misata.from_ddl(open("schema.sql").read())
tables = misata.generate_from_schema(schema)
# Seed PostgreSQL
misata.seed_database(tables, "postgresql://user:pass@localhost/mydb")
Limitations
- Computed / generated columns —
GENERATED ALWAYS ASexpressions are not evaluated; the column is treated as its base type CHECKconstraints — parsed but not enforced during generation; usemisata.yamlfor custom constraint rules- Enum types —
CREATE TYPE ... AS ENUMis not parsed; usemisata.yamlto define categorical choices - Complex DEFAULT expressions —
DEFAULT NOW(),DEFAULT uuid_generate_v4()are ignored; Misata generates values from type-appropriate distributions
Related
- YAML Schema — full control over distributions and constraints
- Database Seeding — seed your generated data into a real database
- Validate — verify generated data quality