Generating Test Data
Data Studio uses AI to generate realistic test data based on your schema. This guide covers the generation workflow, prompting strategies, locale support, and version management.
How Generation Works
When you generate data, Data Studio:
- Reads your schema and field descriptions
- Sends a prompt to the AI with context about the data shape
- Generates the requested number of rows
- Validates each row against the schema
- Stores the result as a new immutable version
Each generation creates a new version. Previous versions remain accessible in the Versions tab.
Generating Data
- Open a dataset and click the Generate button in the toolbar
- Configure the generation in the dialog:
- Row count — number of records to generate (1 to your tier max)
- Prompt (optional) — natural-language context for the AI (e.g., "Generate B2B SaaS customers with enterprise-tier pricing")
- Locale (optional) — language/region for internationalized data (defaults to
en_US)
- Click Generate
Small Datasets (Under 500 Rows)
Generation completes immediately. The Data tab refreshes to show the new records, and a new entry appears in the Versions tab.
Large Datasets (500+ Rows)
A progress bar appears in the toolbar showing generation progress. You can navigate away — the generation continues in the background. When complete:
- A notification appears confirming the generation finished
- The Data tab updates with the new records
- The Versions tab shows the new version with row count, duration, and tokens used
Writing Effective Prompts
Prompts provide semantic context beyond field descriptions. The difference is significant:
| Prompt | AI Output |
|---|---|
| (empty) | Generic customers with random attributes |
| "Generate Fortune 500 enterprise customers with multi-year contracts and annual revenue over $1M" | Enterprise-focused customers with realistic high-value attributes |
Prompt Tips
- Be specific about the domain — "B2B SaaS" vs "B2C retail" produces very different data
- Describe relationships — "customers who recently churned" adds temporal context
- Specify value ranges — "revenue between $50K and $500K" constrains numeric output
- Include temporal context — "signed up in the last 30 days" produces recent timestamps
- Name the persona — "healthcare providers in California" narrows the output
Example Prompts by Use Case
| Use Case | Prompt |
|---|---|
| Integration testing | "Generate active users with verified emails and complete profiles" |
| Load testing | "Generate diverse customer demographics across all US regions" |
| Error handling | "Generate users with missing fields, invalid emails, and expired subscriptions" |
| Demo data | "Generate polished, realistic Fortune 100 companies with recognizable industry names" |
| Edge cases | "Generate records with Unicode names, maximum-length strings, and boundary numeric values" |
Locale Support
Select a locale from the Locale dropdown in the generation dialog to produce internationalized data.
Supported locales:
| Locale | Region |
|---|---|
en_US | United States English (default) |
en_GB | British English |
fr_FR | French (France) |
de_DE | German (Germany) |
es_ES | Spanish (Spain) |
ja_JP | Japanese (Japan) |
Locales affect name generation (culture-appropriate names), address formats, phone number formats, and date/time conventions.
Combine locale with a prompt for the best results. Select fr_FR and add "Generate French customers with Paris addresses" to get fully localized output.
Version History
Every generation creates an immutable version. Open the Versions tab to see the full history.
Each version row shows:
- Version number and timestamp
- Row count — records in this version
- Tokens used — AI tokens consumed
- Duration — how long generation took
- Prompt — the prompt used (if any)
Comparing Versions
- Open the Versions tab
- Click a version row to preview its data
- Use version history to compare data across generations or roll back to an earlier dataset
Switching the Active Version
The Data tab always shows the latest version by default. To view an older version:
- Open the Versions tab
- Click the version you want to inspect
- The Data tab updates to show that version's records
Token Usage
Each generation consumes AI tokens. Token usage is displayed:
- In the generation completion notification
- On each version row in the Versions tab
- In the organization-level usage dashboard under Settings > Usage
Factors affecting token usage:
- Number of rows generated
- Schema complexity (field count, nested objects)
- Prompt length
- Locale variants
Optimizing usage:
- Generate larger batches instead of many small generations
- Use concise, targeted field descriptions
- Reuse existing versions when the data hasn't changed
Rate Limits
| Tier | Generations per Hour | Max Rows per Generation |
|---|---|---|
| Free | 10 | 1,000 |
| Pro | 100 | 10,000 |
| Enterprise | Unlimited | Unlimited |
When you reach the limit, the Generate button is disabled. A banner shows when the limit resets. Limits reset hourly.
Troubleshooting
Generated data is too generic
Add more specific prompts and field descriptions. A field named status with the description "Order fulfillment status: pending, processing, shipped, delivered, cancelled" produces far better results than status alone.
Data doesn't match the schema
Check for type mismatches in your schema. Ensure field descriptions are consistent with the selected types.
Large generation seems stuck
Generations of 10,000+ rows can take several minutes. The progress bar updates periodically. You can navigate away and return — the generation continues in the background.