How Storytell processes spreadsheets

Our processing engine converts raw spreadsheet data into SQL databases that can handle millions of records and complex analytical capabilities.

Written By Mark Ku

Last updated 4 months ago

When you upload any spreadsheet-type of files to Storytell, our advanced processing pipeline transforms your raw data into AI-readable insights.

Whenever you reference at least one spreadsheet-type of file in your prompt, Storytell will automatically kick off our spreadsheet processing pipeline that transforms your raw data into AI-readable insights.

Learn more about how to work with your data: How to manage scope

Core capabilities

Spreadsheet processing

Process datasets of up to 1 million rows

Real-world examples:

  • E-commerce: customer transaction records

  • Healthcare: Clinical trial data with patient interactions

  • Manufacturing: Quality control logs

  • Finance: Daily trading data with millions of transactions per file

Cross-tab analysis

Seamlessly work with multi-sheet Excel files as if they're a unified database.

Example scenario:

A retail chain's inventory system with:

  • Products tab: SKUs, descriptions, categories, pricing

  • Suppliers tab: Vendor information, lead times, contact details

  • Orders tab: Purchase history, quantities, delivery dates

  • Stores tab: Location data, capacity, regional groupings

An example prompt you could do with this is:

Which suppliers had the highest delivery delays for electronics products in Q3, and how did this impact store inventory levels in the Northwest region?

Storytell automatically generates complex JOINs across all tabs to provide comprehensive analysis.

How it works

Stage 1: File analysis and classification

Storytell employs AI-powered analysis to understand your data's structure and optimize it for intelligent querying.

Intelligent structure detection

Our processing engine uses large language models to perform comprehensive analysis of your spreadsheet files:

Deep content analysis

AI examines the file structure, identifying header patterns, data types, and organizational logic across all rows and columns

Example
Example: A marketing report with headers like "Q1 Revenue", "Q2 Revenue" gets recognized as quarterly financial data, while "Customer_ID", "First_Name", "Last_Name" is identified as customer demographic information.

Metadata extraction

The system automatically identifies and preserves contextual information such as:

  • Temporal data (date ranges, reporting periods)

  • Geographic identifiers (locations, regions)

  • Source information and processing notes

  • Technical parameters and methodology details

Example
Example: From a sales report header showing "Sales Data - West Region - Q3 2024 - Preliminary Results", the system extracts: • Geographic: "West Region"Temporal: "Q3 2024"Source: "Sales Data"Status: "Preliminary Results"

Column intelligence

Advanced inference determines the purpose and content of each data column, generating clean, descriptive names suitable for querying

Example
Example: A column labeled "Cust ID" becomes "customer_id", "Rev $" becomes "revenue_amount", and "# Orders" becomes "order_count" - making them much more queryable.

Multi-header processing

Handles complex spreadsheets with multi-row headers by combining information from all relevant rows to create complete column descriptions

Example
Example: A financial report with: Row 1: "2024 Performance Metrics" Row 2: "Revenue | Expenses | Profit" Row 3: "USD | USD | USD" Gets processed as: "2024_performance_revenue_usd", "2024_performance_expenses_usd", "2024_performance_profit_usd"

Excel multi-sheet analysis

For Excel files with multiple worksheets, Storytell's AI:

Cross-sheet relationship detection

Identifies logical connections and foreign key relationships between different sheets

Example
Example: In an e-commerce workbook: • "Customers" sheet has a "customer_id" column • "Orders" sheet has a "customer_id" column • System automatically links these as a foreign key relationship

Unified schema creation

Builds database relationships automatically across all sheets, enabling complex cross-sheet queries

Example
Example: A retail inventory system with sheets for "Products", "Suppliers", "Orders", and "Stores" gets unified so you can ask: "Which suppliers had delivery delays that affected inventory levels in our Northwest stores?"

Smart data integration

Links related tables based on naming conventions and content analysis (e.g., "customers" and "customer_orders")

Example
Example: Sheets named "employee_data" and "employee_performance" are automatically recognized as related, even if the linking column is "emp_id" in one sheet and "employee_number" in another.

Stage 2: Database creation and optimization

Each structured file becomes a dedicated, high-performance database instance optimized for analytical queries.

Database architecture:

  • Storage: Columnar format for faster aggregations

  • Memory management: Intelligent caching based on query patterns

  • Indexing: Automatic index creation on frequently queried columns

Example:

Example
File: customer_transactions.csv (2.1M rows, 47 columns) Processing time: ~45 seconds Database creation: ✓ 15 automatic indexes on key columns ✓ Data type optimization (reduced storage by 40%) ✓ Constraint detection (identified 3 foreign key relationships) ✓ Query optimization statistics generated

Handling complex Excel files:

Example
File: Financial_Model_2024.xlsx (12 tabs, 890K total rows) Tab processing: ✓ "Income_Statement" → table with YoY calculations preserved ✓ "Balance_Sheet" → linked to Income_Statement via period_id ✓ "Cash_Flow" → integrated with both financial statements ✓ "Assumptions" → metadata table linked to all calculations Cross-tab relationships: 23 automatic JOIN candidates identified

Stage 3: SQL generation engine

When you ask questions, Storytell's AI analyzes your intent and generates SQL queries tailored to your specific data structure.

Query intelligence capabilities:

  • Semantic understanding: Translates business terms to database columns

  • Context awareness: Uses file metadata and previous queries for better accuracy

  • Performance optimization: Automatically chooses efficient query patterns

  • Complex logic: Handles multi-step analytical reasoning

Real-world query examples:

Simple aggregation:

Example
Question: "What's our average order value by region?" Generated SQL: SELECT store_location, ROUND(AVG(amount), 2) as avg_order_value, COUNT(*) as total_orders FROM customer_transactions GROUP BY store_location ORDER BY avg_order_value DESC;

Complex multi-tab analysis:

Example
Question: "Which product categories had declining margins in Q3, and what suppliers were involved?" Generated SQL: SELECT p.category, s.supplier_name, ROUND(AVG(p.margin_percent), 2) as avg_margin, ROUND(LAG(AVG(p.margin_percent)) OVER ( PARTITION BY p.category ORDER BY DATE_TRUNC('quarter', o.order_date) ), 2) as prev_quarter_margin FROM products p JOIN orders o ON p.product_id = o.product_id JOIN suppliers s ON o.supplier_id = s.supplier_id WHERE DATE_TRUNC('quarter', o.order_date) = '2024-07-01' GROUP BY p.category, s.supplier_name HAVING avg_margin < prev_quarter_margin ORDER BY (prev_quarter_margin - avg_margin) DESC;

Window functions and advanced analytics:

Example
Question: "Show me the top 3 customers by revenue in each region, with their growth trends" Generated SQL: WITH customer_metrics AS ( SELECT customer_id, store_location as region, SUM(amount) as total_revenue, COUNT(*) as order_count, ROUND( (SUM(amount) - LAG(SUM(amount)) OVER ( PARTITION BY customer_id ORDER BY DATE_TRUNC('quarter', transaction_date) )) / LAG(SUM(amount)) OVER ( PARTITION BY customer_id ORDER BY DATE_TRUNC('quarter', transaction_date) ) * 100, 2 ) as growth_percent FROM customer_transactions GROUP BY customer_id, region ), ranked_customers AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY region ORDER BY total_revenue DESC ) as revenue_rank FROM customer_metrics ) SELECT * FROM ranked_customers WHERE revenue_rank <= 3;

Stage 4: Results processing and presentation

Storytell converts SQL results into conversational insights while maintaining data precision and enabling follow-up analysis.

Results enhancement:

  • Contextual formatting: Automatically formats currencies, percentages, dates

  • Statistical insights: Adds trend analysis and comparative metrics

  • Visual preparation: Structures data optimally for chart generation

  • Citation tracking: Maintains links between insights and source data

Example output transformation:

Example
Raw SQL Result: store_location | avg_order_value | total_orders Northeast | 847.32 | 15,432 Southeast | 623.19 | 22,108 West Coast | 934.67 | 18,944 Storytell Enhancement: "Your regional performance shows significant variation: **West Coast leads in average order value** at $934.67 per transaction across 18,944 orders, generating approximately $17.7M in total revenue. **Northeast shows premium positioning** with $847.32 average orders (15,432 transactions), suggesting a high-value customer base. **Southeast demonstrates volume strength** with 22,108 orders despite lower $623.19 average values, indicating broad market penetration. The 50% difference between highest and lowest average order values suggests opportunities for pricing optimization or targeted product positioning."

Technical capabilities and scale

Processing capabilities:

  • Maximum rows: 1 million

  • Maximum columns: unlimited

  • Processing time: ~5-10 minutes per million rows

Complex data structure handling

Multi-header row processing:

Example
Example: Survey data with nested categories Row 1: [blank] | "Demographics" | [blank] | "Preferences" | [blank] Row 2: [blank] | "Age" | "Income" | "Product" | "Service" Row 3: "ID" | "Years" | "USD" | "Rating" | "Rating" Storytell creates: - Column: participant_id (from "ID") - Column: age_years (from "Demographics > Age > Years") - Column: income_usd (from "Demographics > Income > USD") - Column: product_rating (from "Preferences > Product > Rating") - Column: service_rating (from "Preferences > Service > Rating")

Embedded metadata preservation:

Example
Example: Financial report with calculation context Data includes: "Revenue figures exclude extraordinary items (see note 3)" Storytell preserves: Context available for AI to reference in explanations Query result: "Q3 revenue was $2.3M (excluding extraordinary items per note 3)"

Multi-tab Excel processing

Relationship detection algorithms:

  • Primary key identification: Detects unique identifiers across tabs

  • Foreign key mapping: Links related data between sheets automatically

  • Naming pattern analysis: Connects "customer" tab to "customer_orders" tab

  • Cell reference parsing: Follows Excel formulas to understand dependencies

Cross-tab query capabilities:

Example
Scenario: HR database with 8 tabs - employees (2,400 rows): Basic employee information - salaries (14,000 rows): Historical compensation data - departments (45 rows): Organizational structure - projects (890 rows): Project assignments and timelines - reviews (7,200 rows): Performance evaluation history - training (12,300 rows): Skills development records - benefits (3,100 rows): Healthcare and retirement data - locations (23 rows): Office and remote work data Complex query: "Which remote employees in the engineering department had salary increases above 8% last year while also completing AI-related training?" Automatically generates 5-table JOIN across multiple tabs.

Pro-tips for working with Excel/CSV Files

Optimizing Files for Best Performance

For best results, structure your files as databases:

✅ DO:

  • Use consistent headers

  • Keep each row as a complete record

  • Maintain consistent data types per column

  • Use descriptive column names (avoid "Column1", "Data", etc.)

  • Place lookup tables in separate tabs with clear relationships

  • Include date columns in YYYY-MM-DD format when possible

Multi-tab Excel optimization:

  • Place lookup tables (categories, locations) in separate tabs

  • Use consistent column naming across related tabs

  • Include header rows in every tab

  • Avoid merged cells in data regions (headers are OK for simple cases)

Limitations and considerations

Data structure requirements

Files that work best with the database engine:

  • Clean tabular format: Clear headers followed by consistent data rows

  • Database-like structure: Each row represents a record, each column a field

  • Consistent columns: Same data types within each column

  • Multiple related tables: Excel files with related tabs that can be JOINed

Complex data handling

Files that may require alternative processing:

  • Survey research exports: Complex multi-question layouts with varying column structures

  • Financial reports: Heavily formatted reports with merged cells and embedded calculations

  • Pivot table exports: Grouped data with subtotals and multiple aggregation levels

  • Dashboard layouts: Mixed content with charts, summaries, and data tables

  • Research reports: Files containing "dozens of different datasets" in single sheets

Real example of complex structure:

Example
MQ18B_Lr1: You mentioned you Started using a brand because it was popular... Total Gen Z Non Gen Z Teen Z 13-17 Young Adult Z 18-23 * Life-stage 11% 8% 15% 6% 6% * Professional/academic 7% 8% 5% 8% 9% * Identity-based 4% 5% 3% 2% 6%

This type of complex survey report contains multiple questions with different column structures. For such files, Storytell has a fallback mechanism that attempts to extract meaningful information as text chunks, though this results in fidelity loss compared to structured processing and may miss relationships between data points.

File format constraints

  • Excel limitations: 30 tab maximum (Excel's practical limit for performance)

  • Encoding issues: Non-UTF-8 CSV files require automatic conversion

  • Formula preservation: Excel formulas become calculated values (not dynamic)

  • Conditional formatting: Visual formatting is lost, data values preserved