Skip to main content

Data Studio Service Architecture

The Data Studio Service is a NestJS microservice that provides AI-powered dataset management, schema versioning, test scenario generation, and data binding for simulations.

Module Structure

Key Responsibilities

  • Dataset Management: Create, read, update, delete datasets with metadata
  • Schema Definitions: Define field types, validation rules, and constraints
  • Versioning: Track schema changes over time with diff/comparison
  • AI Generation: Generate realistic test data using AI with context awareness
  • Scenario Templates: Predefined test scenarios (signup, checkout, error states)
  • Simulation Bindings: Bind datasets to simulation routes for dynamic responses
  • Analytics: Track dataset usage, field distributions, generation patterns
  • Chat: AI-powered guidance via SSE streaming for data generation strategies
  • Sharing: Team and public sharing with expiration and access controls
  • Locale Support: Generate locale-specific data (addresses, phone numbers, dates)
  • Import/Export: Support for CSV, JSON, YAML formats

Guards & Middleware

GuardPurpose
JwtAuthGuardValidates JWT signature and expiry (applied globally)
TenantGuardExtracts and validates tenant context (applied globally)
NoopAuthGuard (sharing)Public share endpoints bypass authentication

AI Integration

Generation Module

The Generation Module uses the AI Service to create realistic test data:

POST /data-studio/generation/generate
{
"schemaId": "schema_abc123",
"count": 100,
"context": "E-commerce users with purchase history",
"locale": "en-US"
}

Flow:

  1. Validate schema exists and is tenant-accessible
  2. Send schema + context to AI Service
  3. AI Service generates data matching field types and constraints
  4. Validate generated data against schema
  5. Return batch of generated records

Chat Module (SSE Streaming)

The Chat Module provides real-time AI guidance via Server-Sent Events:

GET /data-studio/chat/stream?message=How+to+generate+realistic+addresses

Flow:

  1. Establish SSE connection
  2. Stream AI responses as they generate
  3. Send event: message with partial content
  4. Send event: done on completion
  5. Close stream

Rate Limiting

AI features are rate limited by tier:

TierGeneration RequestsChat MessagesWindow
Free10201 hour
Pro1001001 hour
EnterpriseUnlimitedUnlimited-

Rate limit enforcement uses Redis with sliding window counters:

key: `ratelimit:generation:${tenantId}`
ttl: 3600 (1 hour)

Schema Versioning

Version Creation

When a schema is updated, a version record is created:

POST /data-studio/versions
{
"schemaId": "schema_abc123",
"changes": [
{"type": "field_added", "field": "phone_verified", "fieldType": "boolean"}
],
"description": "Added phone verification field"
}

Storage:

  • Version number (auto-incremented)
  • Snapshot of full schema at that version
  • Change list (diff from previous version)
  • Created timestamp and user

Version Comparison

Compare two schema versions:

GET /data-studio/versions/compare?from=v1&to=v3

Response:

{
"added": ["phone_verified", "last_login"],
"removed": ["legacy_id"],
"modified": [
{"field": "email", "change": "required: false → true"}
]
}

Dataset Bindings

Binding to Simulations

Bind a dataset to a simulation route for dynamic mock responses:

POST /data-studio/bindings
{
"datasetId": "ds_xyz789",
"simulationId": "sim_abc123",
"routePath": "/api/users",
"responseTemplate": "{{dataset.users | sample}}"
}

Flow:

  1. Validate dataset and simulation exist
  2. Check for existing binding on that route
  3. Create binding record
  4. Notify Mock Engine Service of new binding
  5. Mock Engine resolves {{dataset.users | sample}} on request

Template Functions:

  • sample - Random record from dataset
  • sample(n) - N random records
  • first - First record
  • last - Last record
  • where(field, value) - Filter records

Sharing

Public Sharing

Share datasets without authentication:

POST /data-studio/sharing
{
"datasetId": "ds_xyz789",
"visibility": "public",
"expiresAt": "2026-12-31T23:59:59Z"
}

Response:

{
"shareToken": "sh_abc123def456",
"publicUrl": "https://api.surestage.com/data-studio/sharing/public/sh_abc123def456"
}

Public Endpoint:

GET /data-studio/sharing/public/:shareToken

This endpoint uses NoopAuthGuard to bypass authentication. Access is controlled by:

  • Valid share token
  • Expiration timestamp
  • Revocation status

Team Sharing

Share within tenant with role-based visibility:

POST /data-studio/sharing
{
"datasetId": "ds_xyz789",
"visibility": "team",
"roles": ["developer", "qa"]
}

Import/Export

Export Formats

GET /data-studio/import-export/export/:datasetId?format=json

Supported formats:

  • json - JSON array
  • csv - CSV with headers
  • yaml - YAML document
  • ndjson - Newline-delimited JSON (for large datasets)

Import Flow

POST /data-studio/import-export/import
{
"name": "Imported Dataset",
"format": "csv",
"data": "base64_encoded_file"
}

Flow:

  1. Decode uploaded file
  2. Parse format (CSV, JSON, YAML)
  3. Infer schema from data
  4. Validate records
  5. Create dataset and schema
  6. Insert records

Analytics Module

Dataset Metrics

GET /data-studio/analytics/dataset/:datasetId

Metrics:

  • Total records
  • Schema versions
  • Binding count (how many simulations use it)
  • Generation count (how many times used for AI generation)
  • Last accessed timestamp
  • Field distribution (value frequency analysis)

Field Distribution

For enum-like fields, track value distributions:

{
"field": "subscription_tier",
"distribution": {
"free": 0.45,
"pro": 0.35,
"enterprise": 0.20
}
}

Data Access Module

The Data Access Module provides a shared database layer for all modules:

@Injectable()
export class DataAccessService {
constructor(
@InjectRepository(Dataset) private datasetRepo,
@InjectRepository(Schema) private schemaRepo,
@InjectRepository(Version) private versionRepo,
@InjectRepository(Binding) private bindingRepo
) {}
}

Benefits:

  • Centralized query logic
  • Tenant isolation enforcement
  • Shared transaction management
  • Consistent error handling

Security Considerations

Tenant Isolation

All queries include tenant filter:

WHERE dataset.tenant_id = :tenantId

Public Share Tokens

Share tokens are:

  • Cryptographically random (32 bytes)
  • Indexed for fast lookup
  • Revocable
  • Expirable

AI Context

User-provided context in AI generation requests is sanitized:

  • Strip SQL injection patterns
  • Remove script tags
  • Limit length to 1000 characters

Performance

Caching

  • Dataset metadata: 5 minutes (Redis)
  • Schema definitions: 10 minutes (Redis)
  • Public share lookups: 1 hour (Redis)

Database Indexes

CREATE INDEX idx_datasets_tenant_id ON datasets(tenant_id);
CREATE INDEX idx_bindings_instance_id ON bindings(instance_id);
CREATE INDEX idx_sharing_token ON sharing(token);
CREATE INDEX idx_versions_schema_id ON versions(schema_id);

Pagination

Large datasets use cursor-based pagination:

GET /data-studio/datasets?cursor=abc123&limit=50