Studio Founder
Extracting Data From 5 Investor Decks Into Structured JSON
Extracting Data From 5 Investor Decks Into Structured JSON
Key Takeaway
Feed 5 competitor pitch decks to an AI agent, get structured JSON with every slide's text, tables, chart data, and images β 30 seconds per deck instead of 2 hours.
The Problem
We were evaluating 5 companies for potential investment. Each sent a pitch deck. PPTX format. Between 15 and 30 slides each.
The old approach: one analyst opens each deck, reads through it, takes notes in a spreadsheet. Extraction is manual. Comparison is manual. Finding patterns across decks β "do all of them lead with the market size slide?" β is manual.
Two hours per deck. Ten hours total. For what is fundamentally a data extraction task.
Here's the real pain: every deck structures information differently. Company A puts revenue on slide 4. Company B buries it in the appendix. Company C shows it as a chart with no data labels. You can't compare them until you normalize the data. And normalization by hand is tedious, error-prone work.
I don't hire humans for tedious, error-prone work. That's what agents are for.
The Solution
The PowerPoint/PPTX agent reads existing .pptx files and extracts everything β text, tables, chart data, images, speaker notes β into structured JSON. Five decks become five JSON files that can be queried, compared, and analyzed programmatically.
No manual reading. No spreadsheet. No "I think slide 7 had the ARR number."
The Process (with code/config snippets)
Point the agent at a directory of PPTX files:
yamlShow code
# deck-extraction-config.yaml
input:
directory: "./decks/"
files:
- "company-a-pitch.pptx"
- "company-b-pitch.pptx"
- "company-c-pitch.pptx"
- "company-d-pitch.pptx"
- "company-e-pitch.pptx"
extraction:
text: true
tables: true
charts: true # Extract underlying chart data
images: true # Export embedded images
speaker_notes: true
slide_dimensions: true
output:
format: json
directory: "./analysis/"
images_dir: "./analysis/images/"
analysis:
compare: true
detect_slide_types: true # Classify: title, team, market, product, financials, ask
extract_metrics: true # Pull numbers: ARR, growth rate, market size, raise amount
The extraction logic:
pythonShow code
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
def extract_deck(filepath):
prs = Presentation(filepath)
deck = {
"filename": filepath,
"slide_count": len(prs.slides),
"dimensions": {
"width": prs.slide_width.inches,
"height": prs.slide_height.inches
},
"slides": []
}
for i, slide in enumerate(prs.slides):
slide_data = {
"index": i + 1,
"layout": slide.slide_layout.name,
"type": classify_slide(slide), # ML-based classification
"text_blocks": [],
"tables": [],
"charts": [],
"images": [],
"notes": extract_notes(slide)
}
for shape in slide.shapes:
if shape.has_text_frame:
slide_data["text_blocks"].append({
"text": shape.text_frame.text,
"position": get_position(shape),
"font_size": get_dominant_font_size(shape)
})
if shape.has_table:
slide_data["tables"].append(
extract_table(shape.table)
)
if shape.has_chart:
slide_data["charts"].append(
extract_chart_data(shape.chart)
)
if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
img_path = save_image(shape.image, i, filepath)
slide_data["images"].append(img_path)
deck["slides"].append(slide_data)
return deck
def extract_chart_data(chart):
"""Pull the actual data behind a chart β not just the visual"""
data = {
"chart_type": str(chart.chart_type),
"title": chart.chart_title.text_frame.text if chart.has_title else None,
"series": []
}
for series in chart.series:
data["series"].append({
"name": series.name if hasattr(series, 'name') else None,
"values": [float(v) for v in series.values],
"categories": [str(c) for c in chart.plots[0].categories]
})
return data
The output JSON for a single deck:
jsonShow code
{
"filename": "company-a-pitch.pptx",
"slide_count": 22,
"slides": [
{
"index": 1,
"type": "title",
"text_blocks": [
{"text": "Company A β Series A", "font_size": 36},
{"text": "Reinventing supply chain visibility", "font_size": 18}
]
},
{
"index": 5,
"type": "financials",
"text_blocks": [{"text": "Financial Performance", "font_size": 28}],
"charts": [{
"chart_type": "COLUMN_CLUSTERED",
"title": "Monthly Revenue",
"series": [{
"name": "Revenue",
"values": [45000, 52000, 61000, 78000, 95000, 112000],
"categories": ["Oct", "Nov", "Dec", "Jan", "Feb", "Mar"]
}]
}]
}
],
"extracted_metrics": {
"arr": "$1.34M",
"mrr_growth": "18% MoM",
"market_size": "$12B",
"raise_amount": "$5M",
"team_size": 14,
"founded": "2024"
}
}
The comparison analysis across all 5 decks:
jsonShow code
{
"comparison": {
"slide_structure": {
"all_include": ["title", "problem", "solution", "market", "team", "ask"],
"most_include": ["traction", "financials", "competitive"],
"unique_to_one": {
"company-c": ["regulatory_landscape"],
"company-e": ["case_studies"]
}
},
"metrics_comparison": {
"arr_range": ["$800K", "$4.2M"],
"growth_range": ["12% MoM", "22% MoM"],
"raise_range": ["$3M", "$8M"],
"average_slide_count": 21
},
"patterns": [
"4/5 lead with a customer quote, not the problem statement",
"3/5 put financials in the second half (after slide 12)",
"Only 2/5 include a clear competitive matrix",
"All 5 have the team slide within the last 4 slides"
]
}
}
The Results
| Metric | Manual Analysis | Agent Extraction |
|---|---|---|
| Time per deck | ~2 hours | 30 seconds |
| Total for 5 decks | ~10 hours | 2.5 minutes |
| Chart data extracted | Eyeballed from visuals | Exact underlying values |
| Cross-deck comparison | Manual spreadsheet | Automated JSON diff |
| Patterns identified | Subjective impressions | Data-driven structural analysis |
| Repeatable | Start from scratch each time | Same pipeline, new decks |
The chart data extraction alone is worth it. When a deck shows a revenue chart, the agent pulls the actual numbers behind the chart β not what you squint at on the visual. That's the difference between "looks like they're growing fast" and "18% MoM, $112K last month."
Try It Yourself
Drop your PPTX files in a directory and point the agent at them. You'll get structured JSON per deck plus a comparison file if you enable the analysis mode. Works with any PowerPoint β investor decks, sales decks, internal presentations.
The extracted metrics feature uses pattern matching for common business terms (ARR, MRR, TAM, team size, raise amount). It's not perfect β sometimes the raise amount is on the title slide, sometimes buried in notes. But it catches 80%+ of standard pitch deck metrics without configuration.
Reading a pitch deck is 10% insight, 90% data extraction. Automate the 90%. Spend your time on the 10% that requires judgment.
Related case studies
Studio Founder
Converting a Markdown Brief Into a Polished PPTX β With Charts
Write your investor update in markdown, run one command, and get a branded PowerPoint deck with charts, proper layouts, and slide numbers β cutting production time from 90 minutes to 25.
Studio Founder
Converting Our PowerPoint to an Interactive Web Presentation
We converted a static PowerPoint deck to an interactive web presentation with animated transitions, embedded video, and click-through interactions β resulting in 3x more engagement from investors.
Studio Founder
Batch-Editing 50 Slides Across 3 Decks β In One Command
We updated fonts, colors, logos, and footer text across 50 slides in 3 different PowerPoint decks β in one batch command instead of opening each file manually.
Want results like these?
Start free with your own AI team. No credit card required.