Data Studio Service Architecture
The Data Studio Service is a NestJS microservice that provides AI-powered dataset management, schema versioning, test scenario generation, and data binding for simulations.
Module Structure
Key Responsibilities
- Dataset Management: Create, read, update, delete datasets with metadata
- Schema Definitions: Define field types, validation rules, and constraints
- Versioning: Track schema changes over time with diff/comparison
- AI Generation: Generate realistic test data using AI with context awareness
- Scenario Templates: Predefined test scenarios (signup, checkout, error states)
- Simulation Bindings: Bind datasets to simulation routes for dynamic responses
- Analytics: Track dataset usage, field distributions, generation patterns
- Chat: AI-powered guidance via SSE streaming for data generation strategies
- Sharing: Team and public sharing with expiration and access controls
- Locale Support: Generate locale-specific data (addresses, phone numbers, dates)
- Import/Export: Support for CSV, JSON, YAML formats
Guards & Middleware
| Guard | Purpose |
|---|---|
JwtAuthGuard | Validates JWT signature and expiry (applied globally) |
TenantGuard | Extracts and validates tenant context (applied globally) |
NoopAuthGuard (sharing) | Public share endpoints bypass authentication |
AI Integration
Generation Module
The Generation Module uses the AI Service to create realistic test data:
POST /data-studio/generation/generate
{
"schemaId": "schema_abc123",
"count": 100,
"context": "E-commerce users with purchase history",
"locale": "en-US"
}
Flow:
- Validate schema exists and is tenant-accessible
- Send schema + context to AI Service
- AI Service generates data matching field types and constraints
- Validate generated data against schema
- Return batch of generated records
Chat Module (SSE Streaming)
The Chat Module provides real-time AI guidance via Server-Sent Events:
GET /data-studio/chat/stream?message=How+to+generate+realistic+addresses
Flow:
- Establish SSE connection
- Stream AI responses as they generate
- Send
event: messagewith partial content - Send
event: doneon completion - Close stream
Rate Limiting
AI features are rate limited by tier:
| Tier | Generation Requests | Chat Messages | Window |
|---|---|---|---|
| Free | 10 | 20 | 1 hour |
| Pro | 100 | 100 | 1 hour |
| Enterprise | Unlimited | Unlimited | - |
Rate limit enforcement uses Redis with sliding window counters:
key: `ratelimit:generation:${tenantId}`
ttl: 3600 (1 hour)
Schema Versioning
Version Creation
When a schema is updated, a version record is created:
POST /data-studio/versions
{
"schemaId": "schema_abc123",
"changes": [
{"type": "field_added", "field": "phone_verified", "fieldType": "boolean"}
],
"description": "Added phone verification field"
}
Storage:
- Version number (auto-incremented)
- Snapshot of full schema at that version
- Change list (diff from previous version)
- Created timestamp and user
Version Comparison
Compare two schema versions:
GET /data-studio/versions/compare?from=v1&to=v3
Response:
{
"added": ["phone_verified", "last_login"],
"removed": ["legacy_id"],
"modified": [
{"field": "email", "change": "required: false → true"}
]
}
Dataset Bindings
Binding to Simulations
Bind a dataset to a simulation route for dynamic mock responses:
POST /data-studio/bindings
{
"datasetId": "ds_xyz789",
"simulationId": "sim_abc123",
"routePath": "/api/users",
"responseTemplate": "{{dataset.users | sample}}"
}
Flow:
- Validate dataset and simulation exist
- Check for existing binding on that route
- Create binding record
- Notify Mock Engine Service of new binding
- Mock Engine resolves
{{dataset.users | sample}}on request
Template Functions:
sample- Random record from datasetsample(n)- N random recordsfirst- First recordlast- Last recordwhere(field, value)- Filter records
Sharing
Public Sharing
Share datasets without authentication:
POST /data-studio/sharing
{
"datasetId": "ds_xyz789",
"visibility": "public",
"expiresAt": "2026-12-31T23:59:59Z"
}
Response:
{
"shareToken": "sh_abc123def456",
"publicUrl": "https://api.surestage.com/data-studio/sharing/public/sh_abc123def456"
}
Public Endpoint:
GET /data-studio/sharing/public/:shareToken
This endpoint uses NoopAuthGuard to bypass authentication. Access is controlled by:
- Valid share token
- Expiration timestamp
- Revocation status
Team Sharing
Share within tenant with role-based visibility:
POST /data-studio/sharing
{
"datasetId": "ds_xyz789",
"visibility": "team",
"roles": ["developer", "qa"]
}
Import/Export
Export Formats
GET /data-studio/import-export/export/:datasetId?format=json
Supported formats:
json- JSON arraycsv- CSV with headersyaml- YAML documentndjson- Newline-delimited JSON (for large datasets)
Import Flow
POST /data-studio/import-export/import
{
"name": "Imported Dataset",
"format": "csv",
"data": "base64_encoded_file"
}
Flow:
- Decode uploaded file
- Parse format (CSV, JSON, YAML)
- Infer schema from data
- Validate records
- Create dataset and schema
- Insert records
Analytics Module
Dataset Metrics
GET /data-studio/analytics/dataset/:datasetId
Metrics:
- Total records
- Schema versions
- Binding count (how many simulations use it)
- Generation count (how many times used for AI generation)
- Last accessed timestamp
- Field distribution (value frequency analysis)
Field Distribution
For enum-like fields, track value distributions:
{
"field": "subscription_tier",
"distribution": {
"free": 0.45,
"pro": 0.35,
"enterprise": 0.20
}
}
Data Access Module
The Data Access Module provides a shared database layer for all modules:
@Injectable()
export class DataAccessService {
constructor(
@InjectRepository(Dataset) private datasetRepo,
@InjectRepository(Schema) private schemaRepo,
@InjectRepository(Version) private versionRepo,
@InjectRepository(Binding) private bindingRepo
) {}
}
Benefits:
- Centralized query logic
- Tenant isolation enforcement
- Shared transaction management
- Consistent error handling
Security Considerations
Tenant Isolation
All queries include tenant filter:
WHERE dataset.tenant_id = :tenantId
Public Share Tokens
Share tokens are:
- Cryptographically random (32 bytes)
- Indexed for fast lookup
- Revocable
- Expirable
AI Context
User-provided context in AI generation requests is sanitized:
- Strip SQL injection patterns
- Remove script tags
- Limit length to 1000 characters
Performance
Caching
- Dataset metadata: 5 minutes (Redis)
- Schema definitions: 10 minutes (Redis)
- Public share lookups: 1 hour (Redis)
Database Indexes
CREATE INDEX idx_datasets_tenant_id ON datasets(tenant_id);
CREATE INDEX idx_bindings_instance_id ON bindings(instance_id);
CREATE INDEX idx_sharing_token ON sharing(token);
CREATE INDEX idx_versions_schema_id ON versions(schema_id);
Pagination
Large datasets use cursor-based pagination:
GET /data-studio/datasets?cursor=abc123&limit=50
Related
- Data Studio Service API - API reference
- Data Studio Overview - User guide
- AI Service Architecture - AI integration details
- Mock Engine Service - Dataset binding integration