Performance Optimization

Guide to optimizing ShedBoxAI performance for data processing pipelines.

Quick Performance Checklist

Immediate Optimizations

Filter early: Apply contextual filtering before other operations
Use smaller datasets: Start with sample data for development
Enable AI parallel processing: Set parallel: true for AI prompts when processing multiple records
Monitor with verbose output: Use --verbose to identify bottlenecks

Performance Profiling

Built-in Profiling Tools

# Basic verbose output with timing information
shedboxai run config.yaml --verbose

# Introspection for understanding data size
shedboxai introspect config.yaml --include-samples

The --verbose flag provides detailed execution information including:

Data source connection times
Processing operation durations
Record counts at each stage
Error messages and warnings

Data Loading Optimization

File Format Considerations

Performance Ranking (fastest to slowest):

CSV - Simple, widely supported, fast parsing
JSON - Structured but slower parsing
YAML - Human-readable but slowest

# CSV optimization
data_sources:
  optimized:
    type: csv
    path: data.csv
    options:
      encoding: utf-8
      delimiter: ","

Large Dataset Strategies

For large datasets, use early filtering to reduce processing load:

processing:
  # Filter early to reduce dataset size
  contextual_filtering:
    recent_data:
      source: large_dataset
      conditions:
        - field: date
          condition: ">= '2024-01-01'"
          new_name: "filtered_data"

Processing Optimization

Operation Ordering

Optimal Processing Order:

Contextual Filtering - Reduce dataset size early
Relationship Highlighting - Join only necessary data
Format Conversion - Transform final dataset
AI Processing - Process minimal, clean data

processing:
  # ✅ Optimal: Filter first (reduces dataset size)
  contextual_filtering:
    users:
      - field: status
        condition: "== 'active'"
        new_name: "active_users"
  
  # Then process smaller dataset
  format_conversion:
    user_analysis:
      source: active_users  # Smaller dataset
      extract_fields:
        - user_id
        - name
        - total_spent

Parallel AI Processing

For AI operations that process multiple records individually:

ai_interface:
  prompts:
    individual_analysis:
      user_template: |
        Customer: {{name}}
        Spending: ${{total_spent}}
        Generate a brief customer profile.
      
      for_each: user_analysis    # Process each record individually
      parallel: true            # Use parallel processing for speed
      max_tokens: 150           # Limit response length

Memory Optimization

Memory-Efficient Configuration

processing:
  # Filter early to reduce memory usage
  contextual_filtering:
    small_dataset:
      source: large_source
      conditions:
        - field: relevant_field
          condition: "== 'target_value'"
          new_name: "filtered_data"

# Limit AI processing to essential data only
ai_interface:
  prompts:
    efficient_analysis:
      system: "Be concise. Provide brief analysis."
      user_template: "{{filtered_data | first(10)}}"  # Limit data size
      max_tokens: 100  # Limit response length

Large Dataset Strategies

Filter aggressively: Reduce data size as early as possible
Use extract_fields: Only extract fields you need
Limit AI processing: Process only essential records
Sample for development: Use small datasets for testing

# Development configuration with sampling
data_sources:
  development:
    type: csv
    path: large_dataset.csv
    options:
      nrows: 1000  # Limit to first 1000 rows for testing

AI Processing Optimization

Prompt Optimization

ai_interface:
  prompts:
    optimized_analysis:
      # Minimize token usage
      system: "Analyze data efficiently. Be concise."
      
      user_template: |
        Data: {{customer_data}}
        Provide 3-bullet analysis.  # Limit response length
      
      # Response controls
      max_tokens: 150           # Limit response length
      temperature: 0.1          # Reduce creativity for consistency

Learning vs Production Mode

Use store_only mode for development to avoid API costs:

ai_interface:
  prompt_storage:
    enabled: true
    store_only: true        # Learning mode - no API costs
    directory: "prompts"
    
  prompts:
    analysis:
      system: "Analyze customer data."
      user_template: "{{customer_data}}"

Switch to production mode when ready:

ai_interface:
  model:
    type: rest
    url: https://api.openai.com/v1/chat/completions
    headers:
      Authorization: Bearer ${OPENAI_API_KEY}
      Content-Type: application/json
    options:
      model: gpt-3.5-turbo
      
  prompt_storage:
    store_only: false       # Production mode

Output Optimization

Output Format Selection

output:
  type: file
  path: results.json
  format: json              # Most compatible format

Performance Testing

Development Best Practices

Start small: Test with sample data (100-1000 records)
Use verbose output: Monitor performance with --verbose
Test incrementally: Add complexity gradually
Profile regularly: Check performance after changes

# Test with sample data
shedboxai run config.yaml --verbose

# Test introspection performance
shedboxai introspect config.yaml --include-samples --verbose

Performance Troubleshooting

Common Performance Issues

Slow Data Loading

# Diagnose loading issues with verbose output
shedboxai run config.yaml --verbose

# Common causes:
# 1. Large file sizes
# 2. Network latency (API sources)
# 3. Complex data structures

Solutions:

Filter data early to reduce processing load
Use simpler data formats (CSV vs JSON)
Check network connectivity for API sources

High Memory Usage

# Monitor with verbose output
shedboxai run config.yaml --verbose

# Signs of high memory usage:
# - Slow processing
# - System becoming unresponsive
# - Out of memory errors

Solutions:

Filter data early with contextual filtering
Use extract_fields to select only needed columns
Process smaller datasets during development
Limit AI processing to essential records

Slow AI Processing

Solutions:

Use parallel: true for multiple record processing
Limit max_tokens to reduce response time
Use shorter, more focused prompts
Consider using store_only: true for development

Best Practices Summary

Development Phase

Start small: Test with 100-1000 records
Use verbose output: Monitor performance with --verbose
Filter early: Apply contextual filtering first
Limit AI processing: Use store_only: true for development

Production Phase

Filter aggressively: Reduce data size as much as possible
Use parallel AI processing: Enable parallel: true for multi-record AI operations
Optimize prompts: Use concise prompts with token limits
Monitor performance: Use --verbose to track execution times

Scaling Guidelines

< 1K records: Default configuration should work well
1K-10K records: Apply early filtering, use parallel AI processing
> 10K records: Aggressive filtering essential, consider processing in batches

Common Issues - Troubleshooting errors and issues
Debugging Guide - Advanced debugging techniques
Operations Reference - Understanding all available operations

Quick Performance Checklist​

Immediate Optimizations​

Performance Profiling​

Built-in Profiling Tools​

Data Loading Optimization​

File Format Considerations​

Large Dataset Strategies​

Processing Optimization​

Operation Ordering​

Parallel AI Processing​

Memory Optimization​

Memory-Efficient Configuration​

Large Dataset Strategies​

AI Processing Optimization​

Prompt Optimization​

Learning vs Production Mode​

Output Optimization​

Output Format Selection​

Performance Testing​

Development Best Practices​

Performance Troubleshooting​

Common Performance Issues​

Slow Data Loading​

High Memory Usage​

Slow AI Processing​

Best Practices Summary​

Development Phase​

Production Phase​

Scaling Guidelines​

Related Documentation​

Quick Performance Checklist

Immediate Optimizations

Performance Profiling

Built-in Profiling Tools

Data Loading Optimization

File Format Considerations

Large Dataset Strategies

Processing Optimization

Operation Ordering

Parallel AI Processing

Memory Optimization

Memory-Efficient Configuration

Large Dataset Strategies

AI Processing Optimization

Prompt Optimization

Learning vs Production Mode

Output Optimization

Output Format Selection

Performance Testing

Development Best Practices

Performance Troubleshooting

Common Performance Issues

Slow Data Loading

High Memory Usage

Slow AI Processing

Best Practices Summary

Development Phase

Production Phase

Scaling Guidelines

Related Documentation