Studio Founder

Extracting Data From 5 Investor Decks Into Structured JSON

5 decks in 2.5 min vs 10 hrsDesign & Content5 min read

Extracting Data From 5 Investor Decks Into Structured JSON

Key Takeaway

Feed 5 competitor pitch decks to an AI agent, get structured JSON with every slide's text, tables, chart data, and images — 30 seconds per deck instead of 2 hours.

The Problem

We were evaluating 5 companies for potential investment. Each sent a pitch deck. PPTX format. Between 15 and 30 slides each.

The old approach: one analyst opens each deck, reads through it, takes notes in a spreadsheet. Extraction is manual. Comparison is manual. Finding patterns across decks — "do all of them lead with the market size slide?" — is manual.

Two hours per deck. Ten hours total. For what is fundamentally a data extraction task.

Here's the real pain: every deck structures information differently. Company A puts revenue on slide 4. Company B buries it in the appendix. Company C shows it as a chart with no data labels. You can't compare them until you normalize the data. And normalization by hand is tedious, error-prone work.

I don't hire humans for tedious, error-prone work. That's what agents are for.

The Solution

The PowerPoint/PPTX agent reads existing .pptx files and extracts everything — text, tables, chart data, images, speaker notes — into structured JSON. Five decks become five JSON files that can be queried, compared, and analyzed programmatically.

No manual reading. No spreadsheet. No "I think slide 7 had the ARR number."

The Process (with code/config snippets)

Point the agent at a directory of PPTX files:

yamlShow code

# deck-extraction-config.yaml
input:
  directory: "./decks/"
  files:
    - "company-a-pitch.pptx"
    - "company-b-pitch.pptx"
    - "company-c-pitch.pptx"
    - "company-d-pitch.pptx"
    - "company-e-pitch.pptx"

extraction:
  text: true
  tables: true
  charts: true          # Extract underlying chart data
  images: true          # Export embedded images
  speaker_notes: true
  slide_dimensions: true

output:
  format: json
  directory: "./analysis/"
  images_dir: "./analysis/images/"

analysis:
  compare: true
  detect_slide_types: true    # Classify: title, team, market, product, financials, ask
  extract_metrics: true        # Pull numbers: ARR, growth rate, market size, raise amount

The extraction logic:

pythonShow code

from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

def extract_deck(filepath):
    prs = Presentation(filepath)
    deck = {
        "filename": filepath,
        "slide_count": len(prs.slides),
        "dimensions": {
            "width": prs.slide_width.inches,
            "height": prs.slide_height.inches
        },
        "slides": []
    }

    for i, slide in enumerate(prs.slides):
        slide_data = {
            "index": i + 1,
            "layout": slide.slide_layout.name,
            "type": classify_slide(slide),  # ML-based classification
            "text_blocks": [],
            "tables": [],
            "charts": [],
            "images": [],
            "notes": extract_notes(slide)
        }

        for shape in slide.shapes:
            if shape.has_text_frame:
                slide_data["text_blocks"].append({
                    "text": shape.text_frame.text,
                    "position": get_position(shape),
                    "font_size": get_dominant_font_size(shape)
                })

            if shape.has_table:
                slide_data["tables"].append(
                    extract_table(shape.table)
                )

            if shape.has_chart:
                slide_data["charts"].append(
                    extract_chart_data(shape.chart)
                )

            if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
                img_path = save_image(shape.image, i, filepath)
                slide_data["images"].append(img_path)

        deck["slides"].append(slide_data)

    return deck

def extract_chart_data(chart):
    """Pull the actual data behind a chart — not just the visual"""
    data = {
        "chart_type": str(chart.chart_type),
        "title": chart.chart_title.text_frame.text if chart.has_title else None,
        "series": []
    }
    for series in chart.series:
        data["series"].append({
            "name": series.name if hasattr(series, 'name') else None,
            "values": [float(v) for v in series.values],
            "categories": [str(c) for c in chart.plots[0].categories]
        })
    return data

The output JSON for a single deck:

jsonShow code

{
  "filename": "company-a-pitch.pptx",
  "slide_count": 22,
  "slides": [
    {
      "index": 1,
      "type": "title",
      "text_blocks": [
        {"text": "Company A — Series A", "font_size": 36},
        {"text": "Reinventing supply chain visibility", "font_size": 18}
      ]
    },
    {
      "index": 5,
      "type": "financials",
      "text_blocks": [{"text": "Financial Performance", "font_size": 28}],
      "charts": [{
        "chart_type": "COLUMN_CLUSTERED",
        "title": "Monthly Revenue",
        "series": [{
          "name": "Revenue",
          "values": [45000, 52000, 61000, 78000, 95000, 112000],
          "categories": ["Oct", "Nov", "Dec", "Jan", "Feb", "Mar"]
        }]
      }]
    }
  ],
  "extracted_metrics": {
    "arr": "$1.34M",
    "mrr_growth": "18% MoM",
    "market_size": "$12B",
    "raise_amount": "$5M",
    "team_size": 14,
    "founded": "2024"
  }
}

The comparison analysis across all 5 decks:

jsonShow code

{
  "comparison": {
    "slide_structure": {
      "all_include": ["title", "problem", "solution", "market", "team", "ask"],
      "most_include": ["traction", "financials", "competitive"],
      "unique_to_one": {
        "company-c": ["regulatory_landscape"],
        "company-e": ["case_studies"]
      }
    },
    "metrics_comparison": {
      "arr_range": ["$800K", "$4.2M"],
      "growth_range": ["12% MoM", "22% MoM"],
      "raise_range": ["$3M", "$8M"],
      "average_slide_count": 21
    },
    "patterns": [
      "4/5 lead with a customer quote, not the problem statement",
      "3/5 put financials in the second half (after slide 12)",
      "Only 2/5 include a clear competitive matrix",
      "All 5 have the team slide within the last 4 slides"
    ]
  }
}

The Results

Metric	Manual Analysis	Agent Extraction
Time per deck	~2 hours	30 seconds
Total for 5 decks	~10 hours	2.5 minutes
Chart data extracted	Eyeballed from visuals	Exact underlying values
Cross-deck comparison	Manual spreadsheet	Automated JSON diff
Patterns identified	Subjective impressions	Data-driven structural analysis
Repeatable	Start from scratch each time	Same pipeline, new decks

The chart data extraction alone is worth it. When a deck shows a revenue chart, the agent pulls the actual numbers behind the chart — not what you squint at on the visual. That's the difference between "looks like they're growing fast" and "18% MoM, $112K last month."

Try It Yourself

Drop your PPTX files in a directory and point the agent at them. You'll get structured JSON per deck plus a comparison file if you enable the analysis mode. Works with any PowerPoint — investor decks, sales decks, internal presentations.

The extracted metrics feature uses pattern matching for common business terms (ARR, MRR, TAM, team size, raise amount). It's not perfect — sometimes the raise amount is on the title slide, sometimes buried in notes. But it catches 80%+ of standard pitch deck metrics without configuration.

Start free on Mr.Chief →

Reading a pitch deck is 10% insight, 90% data extraction. Automate the 90%. Spend your time on the 10% that requires judgment.

pitch deck analysisPowerPointdata extractioninvestment

Related case studies

Studio Founder

Converting a Markdown Brief Into a Polished PPTX — With Charts

Write your investor update in markdown, run one command, and get a branded PowerPoint deck with charts, proper layouts, and slide numbers — cutting production time from 90 minutes to 25.

25 min vs 90 min4 min read

Studio Founder

Converting Our PowerPoint to an Interactive Web Presentation

We converted a static PowerPoint deck to an interactive web presentation with animated transitions, embedded video, and click-through interactions — resulting in 3x more engagement from investors.

3x more engagement4 min read

Studio Founder

Batch-Editing 50 Slides Across 3 Decks — In One Command

We updated fonts, colors, logos, and footer text across 50 slides in 3 different PowerPoint decks — in one batch command instead of opening each file manually.

50 slides in 2 min4 min read

Want results like these?

Start free with your own AI team. No credit card required.

Start Free →Browse agents