Skip to main content

Data Merging with ShedBox Agent

Combine data from multiple sources using natural language—no SQL required.

Describe What to Merge

You: "Merge my customer list with Stripe payment data"

ShedBox Agent:
✓ Loaded customers.csv (2,456 rows)
✓ Connected to Stripe API (3,892 transactions)

Detected matching field: email
- Matched: 2,312 customers (94%)
- Unmatched customers: 144
- Unmatched transactions: 1,580

Merge preview:
| customer_id | name | email | stripe_customer_id | total_payments |
|-------------|------|-------|-------------------|----------------|
| C001 | Alice Smith | alice@... | cus_abc123 | $2,450 |
| C002 | Bob Jones | bob@... | cus_def456 | $890 |

Proceed with merge?

Merge Types

Inner Join

Only records that match in both sources:

"Merge customers with orders, only include customers who have orders"

Left Join

All records from first source, matching from second:

"Merge all customers with their orders, include customers without orders"

Full Outer Join

All records from both sources:

"Combine both lists and show all records, matched or not"

Union

Stack datasets vertically:

"Combine these three CSV files into one"

Matching Strategies

Exact Match

"Join on customer_id"
"Match where email addresses are the same"

Fuzzy Match

"Match company names even if spelled slightly differently"
"Join on name, allowing for small variations"

Multiple Keys

"Match on both email AND phone number"
"Join where either email or customer_id matches"

Example Workflow

You: "I need to combine data from three sources to get a complete customer view"

ShedBox Agent:
What sources do you want to combine?

You: "Salesforce contacts, Stripe customers, and our PostgreSQL users table"

ShedBox Agent:
✓ Connected to Salesforce (5,234 contacts)
✓ Connected to Stripe (4,892 customers)
✓ Connected to PostgreSQL users (6,123 records)

Schema Analysis:

| Source | Key Fields | Unique Records |
|--------|------------|----------------|
| Salesforce | Email, Id | 5,234 |
| Stripe | email, id | 4,892 |
| PostgreSQL | email, user_id | 6,123 |

Recommended merge strategy:
1. Primary key: email (exists in all sources)
2. Join type: Full outer (capture all records)
3. Conflict resolution: Latest timestamp wins

Preview merge?

You: "Yes"

ShedBox Agent:
✓ Merging datasets...

Merge Results:
- Total unique records: 7,456
- Matched across all 3: 3,892 (52%)
- Matched in 2 sources: 1,234 (17%)
- Unique to Salesforce: 1,108
- Unique to Stripe: 421
- Unique to PostgreSQL: 801

Sample merged record:
{
email: "alice@company.com",
salesforce_id: "003abc123",
stripe_id: "cus_xyz789",
user_id: 12345,
name: "Alice Smith", // from Salesforce
plan: "Enterprise", // from Stripe
signup_date: "2023-06-15", // from PostgreSQL
lifetime_value: 24500, // from Stripe
last_activity: "2024-01-10" // from PostgreSQL
}

You: "Great, save this as our master customer list"

ShedBox Agent:
✓ Saved to master_customers.json
✓ Created reusable pipeline: customer_360_merge.yaml

Generated Pipeline

Merge logic becomes reusable:

data_sources:
salesforce:
type: rest_api
url: https://your-instance.salesforce.com/services/data/v57.0/query
auth:
type: oauth2
params:
q: "SELECT Id, Email, Name, Company FROM Contact"

stripe:
type: rest_api
url: https://api.stripe.com/v1/customers
auth:
type: bearer
token_env: STRIPE_KEY
pagination:
type: cursor
param: starting_after

users:
type: postgresql
connection_env: DATABASE_URL
query: "SELECT user_id, email, signup_date, last_activity FROM users"

processing:
join:
- sources: [salesforce, stripe]
left_key: Email
right_key: email
type: full_outer
prefix: [sf_, stripe_]

- sources: [_result, users]
left_key: COALESCE(sf_Email, stripe_email)
right_key: email
type: full_outer
prefix: [_, db_]

transform:
- operation: coalesce
fields: [sf_Email, stripe_email, db_email]
output: email

- operation: add_field
name: source_count
expression: |
(CASE WHEN sf_Id IS NOT NULL THEN 1 ELSE 0 END) +
(CASE WHEN stripe_id IS NOT NULL THEN 1 ELSE 0 END) +
(CASE WHEN db_user_id IS NOT NULL THEN 1 ELSE 0 END)

output:
type: file
path: master_customers.json

Handling Conflicts

When the same field exists in multiple sources:

"Use the most recent value when there's a conflict"
"Prefer Salesforce data over Stripe for contact info"
"Flag records where sources disagree"

Get Started

Combine your data sources with natural language.

Try ShedBox Agent →