Introspection Overview
ShedBoxAI's introspection feature is specifically designed to make writing configurations effortless with AI assistants like Claude Code. It analyzes your data sources and generates LLM-optimized documentation that enables AI to write perfect ShedBoxAI configurations for you.
What is Introspection?
Introspection is an LLM-powered configuration assistant that:
- Deeply analyzes your data sources with statistical profiling and schema detection
- Generates AI-optimized documentation specifically formatted for LLMs like Claude Code
- Provides intelligent recommendations for processing operations based on your data patterns
- Makes config writing effortless - just paste the introspection report to Claude Code and ask for a configuration!
Key Features
Deep Data Analysis
- Advanced Schema Detection: Full column type inference with pattern recognition (emails, IDs, UUIDs, phone numbers)
- Statistical Profiling: Comprehensive stats including quartiles, distributions, and categorical analysis
- Intelligent Sampling: Stratified sampling for representative data examples
- Relationship Discovery: Automatic detection of primary/foreign key relationships
LLM Optimization
- Token Estimation: Calculates dataset size impact on LLM context windows
- Processing Recommendations: AI suggests optimal operations based on data characteristics
- Configuration Hints: Intelligent suggestions for filtering, grouping, and aggregation
- Error Prevention: Warns about potential issues before they occur
Supported Data Sources
- CSV Analysis: Advanced statistical profiling, categorical detection, identifier recognition
- JSON Analysis: Nested structure mapping, schema consistency validation, array analysis
- REST API Analysis: Response structure analysis, authentication testing, format validation
- YAML Analysis: Configuration validation, structure parsing, key enumeration
- Text Analysis: Content structure detection, encoding analysis, pattern recognition
How It Works
1. Define Your Data Sources
You only need to define your data sources in a basic ShedBoxAI configuration - no processing operations required:
# config.yaml - Just define your data sources
data_sources:
users:
type: csv
path: data/users.csv
orders:
type: rest
url: https://api.example.com/orders
headers:
Authorization: Bearer ${API_TOKEN}
2. Run Introspection
Execute the introspection command to analyze your data sources:
# Generate LLM-optimized analysis report
shedboxai introspect config.yaml --include-samples -o analysis.md
3. Get LLM-Ready Documentation
Introspection produces a comprehensive markdown report that includes:
- Data Source Validation: Connection status and error detection
- Schema Analysis: Detailed structure and statistical profiling
- LLM Processing Notes: Token estimates and context window warnings
- Operation Recommendations: AI-suggested pipeline configurations
- Sample Data: Representative examples for AI context understanding
4. Error Detection and Validation
Introspection identifies common data source problems:
- File Issues: Missing files, permission problems, encoding errors
- API Problems: Authentication failures, network connectivity, rate limiting
- Configuration Errors: Missing environment variables, invalid formats
- Data Quality Issues: Schema inconsistencies, large dataset warnings
Benefits
Data Source Validation
- Connection Testing: Verify all data sources are accessible before building pipelines
- Error Detection: Identify file, API, authentication, and configuration problems early
- Environment Variable Validation: Check missing API keys and credentials
- Permission Verification: Ensure file access rights and API permissions are correct
LLM-Powered Configuration
- AI-Ready Reports: Generate documentation specifically formatted for Claude Code and other LLMs
- Effortless Config Writing: Simply paste the introspection report to AI assistants for instant pipeline generation
- Intelligent Recommendations: Get AI-suggested operations based on your actual data patterns
- Token Optimization: Understand dataset size impact on LLM context windows
Pipeline Planning
- Schema Understanding: Deep analysis of data structure, types, and relationships
- Statistical Profiling: Comprehensive stats including distributions and data quality metrics
- Relationship Discovery: Automatic detection of foreign keys and joinable fields
- Performance Insights: Size warnings and optimization recommendations
LLM-Optimized Output
Introspection generates detailed reports specifically designed for AI consumption:
# Data Source Introspection for LLM
Generated: 2024-01-15T10:30:00
Sources Analyzed: 2/2
## LLM Processing Notes
- Context Window: All datasets are small-medium, direct processing suitable
- Key columns detected: customer_id, order_id - suitable for relationship operations
- Numeric columns available: amount, quantity - suitable for statistical operations
- Categorical columns available: status, category - suitable for grouping operations
## Data Sources
### customers.csv
- **Records**: 1,250 customers
- **Schema**: customer_id (integer, unique), name (string), email (string, 98% valid), status (categorical: active 75%, premium 15%, inactive 10%)
- **Sample Data**: [{"customer_id": 1001, "name": "John Smith", "email": "john@email.com", "status": "active"}]
- **LLM Recommendations**:
- Use contextual_filtering to segment by status
- Email column suitable for pattern validation
- customer_id is primary key for relationships
### orders (REST API)
- **Endpoint**: https://api.example.com/orders
- **Schema**: order_id (string, structured_id pattern), customer_id (integer, foreign key), amount (float, range: $10-$5000)
- **Relationships**: orders.customer_id ↔ customers.customer_id (foreign key)
## Recommended ShedBoxAI Configuration
[Ready-to-use YAML structure with operation suggestions]
Use Cases
Data Source Validation
Quickly verify your configuration:
- Check if data sources are accessible
- Validate connection settings
- Preview data structure and format
- Identify basic connection issues
Pipeline Planning
Plan your processing operations:
- Understand basic data structure
- Get simple processing recommendations
- Identify data formats and types
- Plan filtering and transformation steps
Documentation
Create basic data documentation:
- Generate simple data source overviews
- Share configuration details with team
- Document data source accessibility
- Maintain basic data inventories
Getting Started
Basic Usage
Analyze your ShedBoxAI configuration:
# Analyze data sources in your config
shedboxai introspect config.yaml
Include Sample Data
Add sample data to the report:
shedboxai introspect config.yaml --include-samples
Error Handling and Troubleshooting
Introspection provides comprehensive error detection with helpful recovery guidance:
# Skip errors and continue with partial results
shedboxai introspect config.yaml --skip-errors
# Retry specific failed sources after fixing issues
shedboxai introspect config.yaml --retry source1,source2
# Force overwrite existing output
shedboxai introspect config.yaml --force
Common Issues Detected:
- Missing Files:
File not found: /path/to/data.csv 💡 Check file path and permissions
- Authentication Failures:
Invalid credentials 💡 Check API key in environment variables
- Missing Environment Variables:
💡 Create a .env file with required values
- Large Datasets:
💡 Use contextual_filtering to reduce data size for LLM processing
Integration with Development
Development Workflow
Introspection fits into your workflow:
- Setup Phase: Run introspection on new configurations
- Planning Phase: Use report to plan processing operations
- Development Phase: Reference structure information while building
- Testing Phase: Re-run introspection to validate connections
- Documentation Phase: Include introspection results in project docs
AI Configuration Writing
Introspection enables effortless configuration creation:
- Copy-paste to Claude Code: "Create a ShedBoxAI config from this introspection report"
- Use the AI Assistant Guide: Download our AI Assistant Guide for comprehensive LLM instructions and configuration examples
- Intelligent operation suggestions: Based on detected data patterns and relationships
- Error prevention: Warns about common pitfalls before they happen
- Optimization recommendations: Suggests performance improvements and best practices
Best Practices
When to Use
- Run introspection when setting up new configurations
- Use before building complex processing pipelines
- Include in development workflow for validation
- Generate documentation for team sharing
Documentation Management
- Save introspection results with your project
- Update when data sources change
- Share reports with team members
- Include in project documentation
Security Considerations
- Use appropriate credentials for data access
- Be careful with sample data in sensitive environments
- Review generated reports for sensitive information
- Follow your organization's data policies
Next Steps
- Command Reference - Complete command options
- Analyzers - Understanding different analyzer types
- Documentation Generation - Customizing output
- Getting Started - Building your first pipeline