Generate HR and Workforce Synthetic Data in Python
HR data is deeply sensitive — employee salaries, personal dates of birth, and compensation details are among the most protected data in any organisation. But HR analytics tools, payroll system integrations, workforce planning models, and people analytics dashboards all need realistic employee data to develop against. Misata generates a coherent HR synthetic dataset: departments with realistic headcounts, employees with age-and-seniority-appropriate salaries, and payroll records where net_pay is mathematically consistent with gross_pay and tax_withheld.
The temporal coherence rules are the key differentiator: every employee's hire_date is after their date_of_birth + 18 years, never in the future, and tenure_years is derived directly from hire_date — not from a separate random distribution that could produce impossible values like negative tenure.
import misata
tables = misata.generate("A tech company with 1000 employees and 4 departments", rows=1000, seed=42)
print(list(tables.keys())) # ['departments', 'employees', 'payroll']
print(tables["employees"][["role", "seniority", "salary", "tenure_years"]].describe())
What Misata generates
Three tables: departments → employees → payroll. Every employee belongs to a department; every payroll record belongs to an employee. Salary, tenure, and seniority are logically consistent across the entire dataset.
Tables and columns
| Table | Key columns |
|---|---|
departments |
department_id, name, head_count, budget, location |
employees |
employee_id, department_id, name, email, role, seniority, hire_date, date_of_birth, salary, tenure_years |
payroll |
payroll_id, employee_id, period_start, gross_pay, tax_withheld, net_pay, pay_type |
Realistic distributions
- Salary by seniority: junior ~$65k, mid ~$95k, senior ~$140k, lead ~$180k — lognormal within each band for realistic spread
- Tax rate: Beta(3, 7) clipped to 18–40% — not uniform, not a fixed percentage
tenure_yearsis derived fromhire_date, not random — no employee has negative or impossible tenure- Age coherence: hire_date is always ≥18 years after date_of_birth, and never in the future
net_pay=gross_pay × (1 − tax_withheld)— mathematically enforced on every row
Quick start
import misata
tables = misata.generate(
"A tech company with 1000 employees, monthly payroll, engineering and sales departments",
rows=1000,
seed=42,
)
# Verify age coherence — no employees hired before age 18
import pandas as pd
employees = tables["employees"].copy()
employees["hire_date"] = pd.to_datetime(employees["hire_date"])
employees["date_of_birth"] = pd.to_datetime(employees["date_of_birth"])
employees["age_at_hire"] = (employees["hire_date"] - employees["date_of_birth"]).dt.days / 365
assert (employees["age_at_hire"] >= 18).all()
# Salary distribution by seniority
print(employees.groupby("seniority")["salary"].describe())
Common use cases
- People analytics platform development — build attrition dashboards, salary band analyses, and diversity reports on realistic employee data before your HRIS is connected
- Payroll system integration testing — validate your payroll calculation engine against thousands of employees with varied tax rates and pay types
- GDPR-safe HR reporting — replace real employee exports with synthetic equivalents for vendor demos and external audits
- Workforce planning model training — generate historical headcount and attrition data across departments to train staffing prediction models
- Compensation benchmarking tools — build salary comparison features against synthetic market data without licensing real salary surveys
- Applicant tracking system (ATS) load testing — generate realistic employee databases with department hierarchies for performance testing
Advanced: headcount growth curves
Model headcount evolution over time — hiring sprees, layoffs, and department restructuring:
tables = misata.generate(
"Tech company with 1000 employees — engineering headcount doubled in 2022, "
"layoffs in Q1 2023, rehiring from Q3 2023",
rows=1000,
seed=42,
)
Advanced: locale-aware generation
# Indian IT company — Indian names, INR salaries, PAN/Aadhaar references
tables = misata.generate("Indian IT services company with 500 employees", rows=500)
# German company — German names, EUR salaries, German tax brackets
tables = misata.generate("German manufacturing company with 800 employees, EUR payroll", rows=800)
# UK workforce — GBP salaries, PAYE tax structure
tables = misata.generate("UK technology company with 300 employees, GBP payroll", rows=300)
Advanced: quality-guaranteed generation
tables = misata.generate(
"Tech company with 1000 employees",
min_quality_score=85,
smart_correlations=True, # auto-adds tenure↔salary, experience↔compensation
rows=1000,
seed=42,
)
Formula consistency
Payroll records use formula columns — net_pay is not independently sampled but derived:
# Every row satisfies: net_pay = gross_pay * (1 - tax_withheld)
payroll = tables["payroll"]
calculated = payroll["gross_pay"] * (1 - payroll["tax_withheld"])
assert (abs(payroll["net_pay"] - calculated) < 0.01).all()