Skip to main content

Generating Test Data

Data Studio uses AI to generate realistic test data based on your schema. This guide covers the generation workflow, prompting strategies, locale support, and version management.

How Generation Works

When you generate data, Data Studio:

  1. Reads your schema and field descriptions
  2. Sends a prompt to the AI with context about the data shape
  3. Generates the requested number of rows
  4. Validates each row against the schema
  5. Stores the result as a new immutable version

Each generation creates a new version. Previous versions remain accessible in the Versions tab.

Generating Data

  1. Open a dataset and click the Generate button in the toolbar
  2. Configure the generation in the dialog:
    • Row count — number of records to generate (1 to your tier max)
    • Prompt (optional) — natural-language context for the AI (e.g., "Generate B2B SaaS customers with enterprise-tier pricing")
    • Locale (optional) — language/region for internationalized data (defaults to en_US)
  3. Click Generate

Small Datasets (Under 500 Rows)

Generation completes immediately. The Data tab refreshes to show the new records, and a new entry appears in the Versions tab.

Large Datasets (500+ Rows)

A progress bar appears in the toolbar showing generation progress. You can navigate away — the generation continues in the background. When complete:

  • A notification appears confirming the generation finished
  • The Data tab updates with the new records
  • The Versions tab shows the new version with row count, duration, and tokens used

Writing Effective Prompts

Prompts provide semantic context beyond field descriptions. The difference is significant:

PromptAI Output
(empty)Generic customers with random attributes
"Generate Fortune 500 enterprise customers with multi-year contracts and annual revenue over $1M"Enterprise-focused customers with realistic high-value attributes

Prompt Tips

  • Be specific about the domain — "B2B SaaS" vs "B2C retail" produces very different data
  • Describe relationships — "customers who recently churned" adds temporal context
  • Specify value ranges — "revenue between $50K and $500K" constrains numeric output
  • Include temporal context — "signed up in the last 30 days" produces recent timestamps
  • Name the persona — "healthcare providers in California" narrows the output

Example Prompts by Use Case

Use CasePrompt
Integration testing"Generate active users with verified emails and complete profiles"
Load testing"Generate diverse customer demographics across all US regions"
Error handling"Generate users with missing fields, invalid emails, and expired subscriptions"
Demo data"Generate polished, realistic Fortune 100 companies with recognizable industry names"
Edge cases"Generate records with Unicode names, maximum-length strings, and boundary numeric values"

Locale Support

Select a locale from the Locale dropdown in the generation dialog to produce internationalized data.

Supported locales:

LocaleRegion
en_USUnited States English (default)
en_GBBritish English
fr_FRFrench (France)
de_DEGerman (Germany)
es_ESSpanish (Spain)
ja_JPJapanese (Japan)

Locales affect name generation (culture-appropriate names), address formats, phone number formats, and date/time conventions.

tip

Combine locale with a prompt for the best results. Select fr_FR and add "Generate French customers with Paris addresses" to get fully localized output.

Version History

Every generation creates an immutable version. Open the Versions tab to see the full history.

Each version row shows:

  • Version number and timestamp
  • Row count — records in this version
  • Tokens used — AI tokens consumed
  • Duration — how long generation took
  • Prompt — the prompt used (if any)

Comparing Versions

  1. Open the Versions tab
  2. Click a version row to preview its data
  3. Use version history to compare data across generations or roll back to an earlier dataset

Switching the Active Version

The Data tab always shows the latest version by default. To view an older version:

  1. Open the Versions tab
  2. Click the version you want to inspect
  3. The Data tab updates to show that version's records

Token Usage

Each generation consumes AI tokens. Token usage is displayed:

  • In the generation completion notification
  • On each version row in the Versions tab
  • In the organization-level usage dashboard under Settings > Usage

Factors affecting token usage:

  • Number of rows generated
  • Schema complexity (field count, nested objects)
  • Prompt length
  • Locale variants

Optimizing usage:

  • Generate larger batches instead of many small generations
  • Use concise, targeted field descriptions
  • Reuse existing versions when the data hasn't changed

Rate Limits

TierGenerations per HourMax Rows per Generation
Free101,000
Pro10010,000
EnterpriseUnlimitedUnlimited

When you reach the limit, the Generate button is disabled. A banner shows when the limit resets. Limits reset hourly.

Troubleshooting

Generated data is too generic

Add more specific prompts and field descriptions. A field named status with the description "Order fulfillment status: pending, processing, shipped, delivered, cancelled" produces far better results than status alone.

Data doesn't match the schema

Check for type mismatches in your schema. Ensure field descriptions are consistent with the selected types.

Large generation seems stuck

Generations of 10,000+ rows can take several minutes. The progress bar updates periodically. You can navigate away and return — the generation continues in the background.

Next Steps

  • Scenarios — create test case variants with different prompts
  • Sharing — share generated datasets via public links
  • Datasets — manage dataset metadata and versions