Skip to content

Generate Synthetic Data from SQL DDL

misata.from_ddl() parses SQL CREATE TABLE statements and converts them into a SchemaConfig ready for data generation. Paste your real schema DDL and get synthetic data that matches your database structure — column types, foreign key relationships, and NOT NULL constraints all preserved.

Quick start

import misata

schema = misata.from_ddl("""
    CREATE TABLE users (
        id SERIAL PRIMARY KEY,
        email VARCHAR(255) NOT NULL,
        created_at TIMESTAMP
    );

    CREATE TABLE orders (
        id SERIAL PRIMARY KEY,
        user_id INT REFERENCES users(id),
        amount DECIMAL(10, 2),
        status VARCHAR(50),
        placed_at TIMESTAMP
    );
""")

tables = misata.generate_from_schema(schema)
print(tables["users"].head())
print(tables["orders"].head())

# FK integrity guaranteed
assert tables["orders"]["user_id"].isin(tables["users"]["id"]).all()

Supported SQL dialects

  • PostgreSQLSERIAL, BIGSERIAL, TIMESTAMP WITH TIME ZONE, schema-qualified names, "quoted" identifiers
  • MySQLINT AUTO_INCREMENT, DATETIME, TINYINT
  • SQLiteINTEGER PRIMARY KEY, TEXT, REAL
  • Generic ANSI SQL — standard CREATE TABLE with REFERENCES clauses

SQL type mapping

SQL types Misata type
INT, INTEGER, BIGINT, SMALLINT, SERIAL int
FLOAT, REAL, DOUBLE, DECIMAL, NUMERIC, MONEY float
VARCHAR, TEXT, CHAR, STRING, UUID, JSON, JSONB text
DATE date
TIMESTAMP, DATETIME (any variant) date
BOOLEAN, BOOL boolean

Foreign key detection

Three sources of FK relationships are supported:

1. Inline REFERENCES clause:

user_id INT REFERENCES users(id)

2. Standalone FOREIGN KEY constraint:

FOREIGN KEY (user_id) REFERENCES users(id)

3. Naming convention inference (infer_fks=True, default):

Columns ending in _id that don't have an explicit REFERENCES clause are automatically inferred as foreign keys to the table named by the prefix:

-- order_id → orders(id) inferred automatically
order_id INT

Disable with infer_fks=False if you don't want this behavior.

Configuration

schema = misata.from_ddl(
    ddl,
    infer_fks=True,       # infer FKs from _id naming convention (default True)
    default_rows=1000,    # row count per table (default 1000)
)

Adjust per-table row counts after parsing:

schema = misata.from_ddl(ddl)
# Customize row counts
for table in schema.tables:
    if table.name == "users":
        table.row_count = 5000
    elif table.name == "orders":
        table.row_count = 20000

tables = misata.generate_from_schema(schema)

Full example: PostgreSQL schema

import misata

schema = misata.from_ddl("""
    CREATE TABLE public.customers (
        id          BIGSERIAL PRIMARY KEY,
        email       VARCHAR(255) NOT NULL,
        country     VARCHAR(2),
        created_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW()
    );

    CREATE TABLE public.products (
        id          BIGSERIAL PRIMARY KEY,
        name        TEXT NOT NULL,
        price       NUMERIC(10, 2) NOT NULL,
        category    VARCHAR(100),
        created_at  TIMESTAMP WITH TIME ZONE
    );

    CREATE TABLE public.orders (
        id          BIGSERIAL PRIMARY KEY,
        customer_id BIGINT NOT NULL REFERENCES public.customers(id),
        status      VARCHAR(50) DEFAULT 'pending',
        total       NUMERIC(12, 2),
        placed_at   TIMESTAMP WITH TIME ZONE
    );

    CREATE TABLE public.order_items (
        id          BIGSERIAL PRIMARY KEY,
        order_id    BIGINT NOT NULL REFERENCES public.orders(id),
        product_id  BIGINT NOT NULL REFERENCES public.products(id),
        quantity    INT NOT NULL DEFAULT 1,
        unit_price  NUMERIC(10, 2) NOT NULL
    );
""", default_rows=500)

# Scale appropriately
for t in schema.tables:
    if t.name == "order_items":
        t.row_count = 2000

tables = misata.generate_from_schema(schema, min_quality_score=80)
for name, df in tables.items():
    print(f"{name}: {len(df)} rows")

Export to your database

Generate and seed directly:

import misata

schema = misata.from_ddl(open("schema.sql").read())
tables = misata.generate_from_schema(schema)

# Seed PostgreSQL
misata.seed_database(tables, "postgresql://user:pass@localhost/mydb")

Limitations

  • Computed / generated columnsGENERATED ALWAYS AS expressions are not evaluated; the column is treated as its base type
  • CHECK constraints — parsed but not enforced during generation; use misata.yaml for custom constraint rules
  • Enum typesCREATE TYPE ... AS ENUM is not parsed; use misata.yaml to define categorical choices
  • Complex DEFAULT expressionsDEFAULT NOW(), DEFAULT uuid_generate_v4() are ignored; Misata generates values from type-appropriate distributions
  • YAML Schema — full control over distributions and constraints
  • Database Seeding — seed your generated data into a real database
  • Validate — verify generated data quality