Studio Founder

Extracting Data From 5 Investor Decks Into Structured JSON

5 decks in 2.5 min vs 10 hrsDesign & Content5 min read

Extracting Data From 5 Investor Decks Into Structured JSON

Key Takeaway

Feed 5 competitor pitch decks to an AI agent, get structured JSON with every slide's text, tables, chart data, and images β€” 30 seconds per deck instead of 2 hours.

The Problem

We were evaluating 5 companies for potential investment. Each sent a pitch deck. PPTX format. Between 15 and 30 slides each.

The old approach: one analyst opens each deck, reads through it, takes notes in a spreadsheet. Extraction is manual. Comparison is manual. Finding patterns across decks β€” "do all of them lead with the market size slide?" β€” is manual.

Two hours per deck. Ten hours total. For what is fundamentally a data extraction task.

Here's the real pain: every deck structures information differently. Company A puts revenue on slide 4. Company B buries it in the appendix. Company C shows it as a chart with no data labels. You can't compare them until you normalize the data. And normalization by hand is tedious, error-prone work.

I don't hire humans for tedious, error-prone work. That's what agents are for.

The Solution

The PowerPoint/PPTX agent reads existing .pptx files and extracts everything β€” text, tables, chart data, images, speaker notes β€” into structured JSON. Five decks become five JSON files that can be queried, compared, and analyzed programmatically.

No manual reading. No spreadsheet. No "I think slide 7 had the ARR number."

The Process (with code/config snippets)

Point the agent at a directory of PPTX files:

yamlShow code
# deck-extraction-config.yaml
input:
  directory: "./decks/"
  files:
    - "company-a-pitch.pptx"
    - "company-b-pitch.pptx"
    - "company-c-pitch.pptx"
    - "company-d-pitch.pptx"
    - "company-e-pitch.pptx"

extraction:
  text: true
  tables: true
  charts: true          # Extract underlying chart data
  images: true          # Export embedded images
  speaker_notes: true
  slide_dimensions: true

output:
  format: json
  directory: "./analysis/"
  images_dir: "./analysis/images/"

analysis:
  compare: true
  detect_slide_types: true    # Classify: title, team, market, product, financials, ask
  extract_metrics: true        # Pull numbers: ARR, growth rate, market size, raise amount

The extraction logic:

pythonShow code
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

def extract_deck(filepath):
    prs = Presentation(filepath)
    deck = {
        "filename": filepath,
        "slide_count": len(prs.slides),
        "dimensions": {
            "width": prs.slide_width.inches,
            "height": prs.slide_height.inches
        },
        "slides": []
    }

    for i, slide in enumerate(prs.slides):
        slide_data = {
            "index": i + 1,
            "layout": slide.slide_layout.name,
            "type": classify_slide(slide),  # ML-based classification
            "text_blocks": [],
            "tables": [],
            "charts": [],
            "images": [],
            "notes": extract_notes(slide)
        }

        for shape in slide.shapes:
            if shape.has_text_frame:
                slide_data["text_blocks"].append({
                    "text": shape.text_frame.text,
                    "position": get_position(shape),
                    "font_size": get_dominant_font_size(shape)
                })

            if shape.has_table:
                slide_data["tables"].append(
                    extract_table(shape.table)
                )

            if shape.has_chart:
                slide_data["charts"].append(
                    extract_chart_data(shape.chart)
                )

            if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
                img_path = save_image(shape.image, i, filepath)
                slide_data["images"].append(img_path)

        deck["slides"].append(slide_data)

    return deck

def extract_chart_data(chart):
    """Pull the actual data behind a chart β€” not just the visual"""
    data = {
        "chart_type": str(chart.chart_type),
        "title": chart.chart_title.text_frame.text if chart.has_title else None,
        "series": []
    }
    for series in chart.series:
        data["series"].append({
            "name": series.name if hasattr(series, 'name') else None,
            "values": [float(v) for v in series.values],
            "categories": [str(c) for c in chart.plots[0].categories]
        })
    return data

The output JSON for a single deck:

jsonShow code
{
  "filename": "company-a-pitch.pptx",
  "slide_count": 22,
  "slides": [
    {
      "index": 1,
      "type": "title",
      "text_blocks": [
        {"text": "Company A β€” Series A", "font_size": 36},
        {"text": "Reinventing supply chain visibility", "font_size": 18}
      ]
    },
    {
      "index": 5,
      "type": "financials",
      "text_blocks": [{"text": "Financial Performance", "font_size": 28}],
      "charts": [{
        "chart_type": "COLUMN_CLUSTERED",
        "title": "Monthly Revenue",
        "series": [{
          "name": "Revenue",
          "values": [45000, 52000, 61000, 78000, 95000, 112000],
          "categories": ["Oct", "Nov", "Dec", "Jan", "Feb", "Mar"]
        }]
      }]
    }
  ],
  "extracted_metrics": {
    "arr": "$1.34M",
    "mrr_growth": "18% MoM",
    "market_size": "$12B",
    "raise_amount": "$5M",
    "team_size": 14,
    "founded": "2024"
  }
}

The comparison analysis across all 5 decks:

jsonShow code
{
  "comparison": {
    "slide_structure": {
      "all_include": ["title", "problem", "solution", "market", "team", "ask"],
      "most_include": ["traction", "financials", "competitive"],
      "unique_to_one": {
        "company-c": ["regulatory_landscape"],
        "company-e": ["case_studies"]
      }
    },
    "metrics_comparison": {
      "arr_range": ["$800K", "$4.2M"],
      "growth_range": ["12% MoM", "22% MoM"],
      "raise_range": ["$3M", "$8M"],
      "average_slide_count": 21
    },
    "patterns": [
      "4/5 lead with a customer quote, not the problem statement",
      "3/5 put financials in the second half (after slide 12)",
      "Only 2/5 include a clear competitive matrix",
      "All 5 have the team slide within the last 4 slides"
    ]
  }
}

The Results

MetricManual AnalysisAgent Extraction
Time per deck~2 hours30 seconds
Total for 5 decks~10 hours2.5 minutes
Chart data extractedEyeballed from visualsExact underlying values
Cross-deck comparisonManual spreadsheetAutomated JSON diff
Patterns identifiedSubjective impressionsData-driven structural analysis
RepeatableStart from scratch each timeSame pipeline, new decks

The chart data extraction alone is worth it. When a deck shows a revenue chart, the agent pulls the actual numbers behind the chart β€” not what you squint at on the visual. That's the difference between "looks like they're growing fast" and "18% MoM, $112K last month."

Try It Yourself

Drop your PPTX files in a directory and point the agent at them. You'll get structured JSON per deck plus a comparison file if you enable the analysis mode. Works with any PowerPoint β€” investor decks, sales decks, internal presentations.

The extracted metrics feature uses pattern matching for common business terms (ARR, MRR, TAM, team size, raise amount). It's not perfect β€” sometimes the raise amount is on the title slide, sometimes buried in notes. But it catches 80%+ of standard pitch deck metrics without configuration.


Reading a pitch deck is 10% insight, 90% data extraction. Automate the 90%. Spend your time on the 10% that requires judgment.

pitch deck analysisPowerPointdata extractioninvestment

Want results like these?

Start free with your own AI team. No credit card required.

Extracting Data From 5 Investor Decks Into Structured JSON β€” Mr.Chief