How Storytell processes spreadsheets

Our processing engine converts raw spreadsheet data into SQL databases that can handle millions of records and complex analytical capabilities.

Written By Mark Ku

Last updated 5 months ago

When you upload any spreadsheet-type of files to Storytell, our advanced processing pipeline transforms your raw data into AI-readable insights.

Whenever you reference at least one spreadsheet-type of file in your prompt, Storytell will automatically kick off our spreadsheet processing pipeline that transforms your raw data into AI-readable insights.

Learn more about how to work with your data: How to manage scope

Core capabilities

Spreadsheet processing

Process datasets of up to 1 million rows

Real-world examples:

E-commerce: customer transaction records
Healthcare: Clinical trial data with patient interactions
Manufacturing: Quality control logs
Finance: Daily trading data with millions of transactions per file

Cross-tab analysis

Seamlessly work with multi-sheet Excel files as if they're a unified database.

Example scenario:

A retail chain's inventory system with:

Products tab: SKUs, descriptions, categories, pricing
Suppliers tab: Vendor information, lead times, contact details
Orders tab: Purchase history, quantities, delivery dates
Stores tab: Location data, capacity, regional groupings

An example prompt you could do with this is:

Which suppliers had the highest delivery delays for electronics products in Q3, and how did this impact store inventory levels in the Northwest region?

Storytell automatically generates complex JOINs across all tabs to provide comprehensive analysis.

How it works

Stage 1: File analysis and classification

Storytell employs AI-powered analysis to understand your data's structure and optimize it for intelligent querying.

Intelligent structure detection

Our processing engine uses large language models to perform comprehensive analysis of your spreadsheet files:

Deep content analysis

AI examines the file structure, identifying header patterns, data types, and organizational logic across all rows and columns

Example
Example: A marketing report with headers like "Q1 Revenue", "Q2 Revenue" gets 
recognized as quarterly financial data, while "Customer_ID", "First_Name", 
"Last_Name" is identified as customer demographic information.

Metadata extraction

The system automatically identifies and preserves contextual information such as:

Temporal data (date ranges, reporting periods)
Geographic identifiers (locations, regions)
Source information and processing notes
Technical parameters and methodology details

Example
Example: From a sales report header showing "Sales Data - West Region - Q3 2024 - 
Preliminary Results", the system extracts:

• Geographic: "West Region"
• Temporal: "Q3 2024"
• Source: "Sales Data"
• Status: "Preliminary Results"

Column intelligence

Advanced inference determines the purpose and content of each data column, generating clean, descriptive names suitable for querying

Example
Example: A column labeled "Cust ID" becomes "customer_id", "Rev $" becomes 
"revenue_amount", and "# Orders" becomes "order_count" - making them much 
more queryable.

Multi-header processing

Handles complex spreadsheets with multi-row headers by combining information from all relevant rows to create complete column descriptions

Example
Example: A financial report with:

Row 1: "2024 Performance Metrics"
Row 2: "Revenue | Expenses | Profit"
Row 3: "USD | USD | USD"

Gets processed as: "2024_performance_revenue_usd", 
"2024_performance_expenses_usd", "2024_performance_profit_usd"

Excel multi-sheet analysis

For Excel files with multiple worksheets, Storytell's AI:

Cross-sheet relationship detection

Identifies logical connections and foreign key relationships between different sheets

Example
Example: In an e-commerce workbook:

• "Customers" sheet has a "customer_id" column
• "Orders" sheet has a "customer_id" column
• System automatically links these as a foreign key relationship

Unified schema creation

Builds database relationships automatically across all sheets, enabling complex cross-sheet queries

Example
Example: A retail inventory system with sheets for "Products", "Suppliers", 
"Orders", and "Stores" gets unified so you can ask: "Which suppliers had 
delivery delays that affected inventory levels in our Northwest stores?"

Smart data integration

Links related tables based on naming conventions and content analysis (e.g., "customers" and "customer_orders")

Example
Example: Sheets named "employee_data" and "employee_performance" are 
automatically recognized as related, even if the linking column is "emp_id" 
in one sheet and "employee_number" in another.

Stage 2: Database creation and optimization

Each structured file becomes a dedicated, high-performance database instance optimized for analytical queries.

Database architecture:

Storage: Columnar format for faster aggregations
Memory management: Intelligent caching based on query patterns
Indexing: Automatic index creation on frequently queried columns

Example:

Example
File: customer_transactions.csv (2.1M rows, 47 columns) 
Processing time: ~45 seconds 

Database creation: 
✓ 15 automatic indexes on key columns 
✓ Data type optimization (reduced storage by 40%) 
✓ Constraint detection (identified 3 foreign key relationships) 
✓ Query optimization statistics generated

Handling complex Excel files:

Example
File: Financial_Model_2024.xlsx (12 tabs, 890K total rows)

Tab processing:
  ✓ "Income_Statement" → table with YoY calculations preserved
  ✓ "Balance_Sheet" → linked to Income_Statement via period_id
  ✓ "Cash_Flow" → integrated with both financial statements
  ✓ "Assumptions" → metadata table linked to all calculations

Cross-tab relationships: 23 automatic JOIN candidates identified

Stage 3: SQL generation engine

When you ask questions, Storytell's AI analyzes your intent and generates SQL queries tailored to your specific data structure.

Query intelligence capabilities:

Semantic understanding: Translates business terms to database columns
Context awareness: Uses file metadata and previous queries for better accuracy
Performance optimization: Automatically chooses efficient query patterns
Complex logic: Handles multi-step analytical reasoning

Real-world query examples:

Simple aggregation:

Example
Question: "What's our average order value by region?"
Generated SQL:
SELECT 
    store_location,
    ROUND(AVG(amount), 2) as avg_order_value,
    COUNT(*) as total_orders
FROM customer_transactions 
GROUP BY store_location 
ORDER BY avg_order_value DESC;

Complex multi-tab analysis:

Example
Question: "Which product categories had declining margins in Q3, and what suppliers were involved?"
Generated SQL:
SELECT 
    p.category,
    s.supplier_name,
    ROUND(AVG(p.margin_percent), 2) as avg_margin,
    ROUND(LAG(AVG(p.margin_percent)) OVER (
        PARTITION BY p.category 
        ORDER BY DATE_TRUNC('quarter', o.order_date)
    ), 2) as prev_quarter_margin
FROM products p
JOIN orders o ON p.product_id = o.product_id  
JOIN suppliers s ON o.supplier_id = s.supplier_id
WHERE DATE_TRUNC('quarter', o.order_date) = '2024-07-01'
GROUP BY p.category, s.supplier_name
HAVING avg_margin < prev_quarter_margin
ORDER BY (prev_quarter_margin - avg_margin) DESC;

Window functions and advanced analytics:

Example
Question: "Show me the top 3 customers by revenue in each region, with their growth trends"
Generated SQL:
WITH customer_metrics AS (
    SELECT 
        customer_id,
        store_location as region,
        SUM(amount) as total_revenue,
        COUNT(*) as order_count,
        ROUND(
            (SUM(amount) - LAG(SUM(amount)) OVER (
                PARTITION BY customer_id 
                ORDER BY DATE_TRUNC('quarter', transaction_date)
            )) / LAG(SUM(amount)) OVER (
                PARTITION BY customer_id 
                ORDER BY DATE_TRUNC('quarter', transaction_date)
            ) * 100, 2
        ) as growth_percent
    FROM customer_transactions
    GROUP BY customer_id, region
),
ranked_customers AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY region 
            ORDER BY total_revenue DESC
        ) as revenue_rank
    FROM customer_metrics
)
SELECT * FROM ranked_customers WHERE revenue_rank <= 3;

Stage 4: Results processing and presentation

Storytell converts SQL results into conversational insights while maintaining data precision and enabling follow-up analysis.

Results enhancement:

Contextual formatting: Automatically formats currencies, percentages, dates
Statistical insights: Adds trend analysis and comparative metrics
Visual preparation: Structures data optimally for chart generation
Citation tracking: Maintains links between insights and source data

Example output transformation:

Example
Raw SQL Result:
store_location | avg_order_value | total_orders
Northeast     | 847.32          | 15,432
Southeast     | 623.19          | 22,108
West Coast    | 934.67          | 18,944

Storytell Enhancement:
"Your regional performance shows significant variation:

**West Coast leads in average order value** at $934.67 per transaction across 18,944 orders, generating approximately $17.7M in total revenue.

**Northeast shows premium positioning** with $847.32 average orders (15,432 transactions), suggesting a high-value customer base.

**Southeast demonstrates volume strength** with 22,108 orders despite lower $623.19 average values, indicating broad market penetration.

The 50% difference between highest and lowest average order values suggests opportunities for pricing optimization or targeted product positioning."

Technical capabilities and scale

Processing capabilities:

Maximum rows: 1 million
Maximum columns: unlimited
Processing time: ~5-10 minutes per million rows

Complex data structure handling

Multi-header row processing:

Example
Example: Survey data with nested categories
Row 1: [blank] | "Demographics" | [blank] | "Preferences" | [blank]
Row 2: [blank] | "Age" | "Income" | "Product" | "Service"
Row 3: "ID" | "Years" | "USD" | "Rating" | "Rating"

Storytell creates:
- Column: participant_id (from "ID")
- Column: age_years (from "Demographics > Age > Years") 
- Column: income_usd (from "Demographics > Income > USD")
- Column: product_rating (from "Preferences > Product > Rating")
- Column: service_rating (from "Preferences > Service > Rating")

Embedded metadata preservation:

Example
Example: Financial report with calculation context
Data includes: "Revenue figures exclude extraordinary items (see note 3)"
Storytell preserves: Context available for AI to reference in explanations
Query result: "Q3 revenue was $2.3M (excluding extraordinary items per note 3)"

Multi-tab Excel processing

Relationship detection algorithms:

Primary key identification: Detects unique identifiers across tabs
Foreign key mapping: Links related data between sheets automatically
Naming pattern analysis: Connects "customer" tab to "customer_orders" tab
Cell reference parsing: Follows Excel formulas to understand dependencies

Cross-tab query capabilities:

Example
Scenario: HR database with 8 tabs
- employees (2,400 rows): Basic employee information
- salaries (14,000 rows): Historical compensation data  
- departments (45 rows): Organizational structure
- projects (890 rows): Project assignments and timelines
- reviews (7,200 rows): Performance evaluation history
- training (12,300 rows): Skills development records
- benefits (3,100 rows): Healthcare and retirement data
- locations (23 rows): Office and remote work data

Complex query: "Which remote employees in the engineering department had salary increases above 8% last year while also completing AI-related training?"

Automatically generates 5-table JOIN across multiple tabs.

Pro-tips for working with Excel/CSV Files

Optimizing Files for Best Performance

For best results, structure your files as databases:

✅ DO:

Use consistent headers
Keep each row as a complete record
Maintain consistent data types per column
Use descriptive column names (avoid "Column1", "Data", etc.)
Place lookup tables in separate tabs with clear relationships
Include date columns in YYYY-MM-DD format when possible

Multi-tab Excel optimization:

Place lookup tables (categories, locations) in separate tabs
Use consistent column naming across related tabs
Include header rows in every tab
Avoid merged cells in data regions (headers are OK for simple cases)

Limitations and considerations

Data structure requirements

Files that work best with the database engine:

Clean tabular format: Clear headers followed by consistent data rows
Database-like structure: Each row represents a record, each column a field
Consistent columns: Same data types within each column
Multiple related tables: Excel files with related tabs that can be JOINed

Complex data handling

Files that may require alternative processing:

Survey research exports: Complex multi-question layouts with varying column structures
Financial reports: Heavily formatted reports with merged cells and embedded calculations
Pivot table exports: Grouped data with subtotals and multiple aggregation levels
Dashboard layouts: Mixed content with charts, summaries, and data tables
Research reports: Files containing "dozens of different datasets" in single sheets

Real example of complex structure:

Example
MQ18B_Lr1: You mentioned you Started using a brand because it was popular...
                          Total  Gen Z  Non Gen Z  Teen Z 13-17  Young Adult Z 18-23
* Life-stage              11%    8%     15%        6%             6%
* Professional/academic    7%    8%      5%        8%             9%
* Identity-based          4%    5%      3%        2%             6%

This type of complex survey report contains multiple questions with different column structures. For such files, Storytell has a fallback mechanism that attempts to extract meaningful information as text chunks, though this results in fidelity loss compared to structured processing and may miss relationships between data points.

File format constraints

Excel limitations: 30 tab maximum (Excel's practical limit for performance)
Encoding issues: Non-UTF-8 CSV files require automatic conversion
Formula preservation: Excel formulas become calculated values (not dynamic)
Conditional formatting: Visual formatting is lost, data values preserved