Researcher
ML-Powered Biomarker Discovery β Design a Publication-Ready Study in 30 Minutes
Key Takeaway
The Non-Tumor ML Research Planner skill generates complete machine learning study designs for non-cancer diseases β from GEO dataset selection to diagnostic model building to immune infiltration analysis. One prompt produces a study framework that would take a research team weeks to plan, with 4 workload tiers from 2-week sprint to 14-week publication.
The Problem
You're a researcher who wants to use machine learning to discover diagnostic biomarkers for a non-cancer disease β say, rheumatoid arthritis, Alzheimer's, or diabetic nephropathy. The challenge:
- Which GEO datasets are relevant and high-quality?
- Which ML algorithms are appropriate for your sample size?
- How do you handle feature selection with thousands of genes?
- How do you validate your model properly (not just overfit)?
- What's the right way to integrate immune infiltration analysis?
- How do you design this for actual publication, not just a class project?
Designing a rigorous ML study takes 2-4 weeks of methodological planning. Junior researchers make predictable mistakes: they don't correct for batch effects, they overfit with aggressive feature selection, they use accuracy instead of AUC on imbalanced datasets, and they skip external validation.
The Solution
The Non-Tumor ML Research Planner generates complete study designs following 5 validated patterns, with explicit methodology choices, statistical thresholds, and failure contingencies at every step.
The Process
View details
You: Design a machine learning study to identify diagnostic
biomarkers for early-stage Alzheimer's disease using
publicly available transcriptomic data.
The agent generates a structured study design:
View details
## ML Biomarker Discovery β Early Alzheimer's Disease
### STUDY PATTERN: DEG-to-Diagnostic
(Differential Expression β Feature Selection β ML Model)
### SPECIFIC AIMS
Aim 1: Identify differentially expressed genes between
early AD patients and cognitively normal controls using
GEO transcriptomic datasets (DISCOVERY)
Aim 2: Build and validate a diagnostic classifier using
ensemble ML with cross-validation and external test set
(MODEL)
Aim 3: Characterize immune cell infiltration patterns
associated with early AD using computational deconvolution
(MECHANISM)
Aim 4: Identify hub genes through PPI network analysis and
validate clinical relevance through survival/progression
correlation (TRANSLATIONAL)
### DATASET SELECTION
Training datasets (merged):
- GSE63060: Blood transcriptome, 145 AD + 104 controls β
- GSE63061: Blood transcriptome, 139 AD + 135 controls β
- Combined n=523 (sufficient for ML)
External validation:
- GSE97760: Independent cohort, 36 AD + 36 controls
- GSE122063: Brain tissue (different tissue = stronger
validation)
Quality checks:
β Platform compatibility (all Illumina HumanHT-12)
β Batch effect correction (ComBat or limma::removeBatchEffect)
β Sample quality (boxplots, PCA for outliers)
β Clinical covariate balance (age, sex, APOE status)
### ANALYTICAL PIPELINE (14 Steps)
Step 1: Data preprocessing
Tool: R/Bioconductor (GEOquery, limma)
Normalization: quantile normalization
Batch correction: ComBat (sva package)
Threshold: Remove probes with detection p>0.05 in
>50% of samples
Step 2: Differential expression analysis
Tool: limma
Model: ~disease_status + age + sex + APOE
Thresholds: |logFC| > 0.5, adj.p < 0.05
Expected: 200-800 DEGs
Step 3: Feature selection (3-layer approach)
Layer 1: DEG filtering (Step 2 output)
Layer 2: LASSO regression (glmnet, lambda.min)
Layer 3: Random Forest variable importance (top 50)
Intersection: genes passing β₯2 of 3 methods
Expected: 15-40 candidate biomarkers
Step 4: ML model building
Algorithms: Random Forest, SVM (radial kernel),
XGBoost, LASSO logistic regression
Validation: 10-fold cross-validation, repeated 100Γ
Metric: AUC-ROC (primary), sensitivity, specificity
Threshold: AUC > 0.80 for clinical utility
Step 5: Ensemble model
Method: Weighted average of top 2 performing algorithms
Weight: proportional to CV-AUC
Step 6: External validation
Apply model to GSE97760 and GSE122063
Report: AUC, sensitivity, specificity, calibration plot
Decision: AUC > 0.75 on external set = proceed to paper
Step 7: CIBERSORT immune deconvolution
Tool: CIBERSORT (LM22 signature matrix)
Analysis: Compare immune cell fractions AD vs control
Correlation: hub genes vs immune cell proportions
[Steps 8-14: PPI network, GO/KEGG enrichment, hub gene
identification, clinical correlation, figure preparation,
manuscript outline]
### WORKLOAD TIERS
| Tier | Timeline | Figures | Target Journal |
|------|----------|---------|---------------|
| Lite | 2-4 weeks | 5-6 | Brief communication |
| Standard | 4-8 weeks | 8-10 | Specialty journal |
| Advanced | 8-14 weeks | 12-16 | High-impact specialty |
| Publication+ | 14+ weeks | 16-20 | High-impact general |
### COMMON PITFALLS (pre-empted)
β οΈ Overfitting: 10-fold CV + external validation prevents this
β οΈ Batch effects: ComBat + PCA visualization before analysis
β οΈ Imbalanced classes: Use AUC not accuracy, apply SMOTE if needed
β οΈ Multiple testing: Benjamini-Hochberg correction for all DEG
β οΈ Data leakage: Feature selection INSIDE CV folds, not before
The Results
| Metric | Manual Study Design | AI Agent |
|---|---|---|
| Time to complete design | 2-4 weeks | 30 minutes |
| Methodological completeness | Variable (experience-dependent) | 14 steps, all documented |
| Pitfall prevention | Learned from reviewer rejection | Pre-empted in design |
| Dataset recommendations | Hours of GEO browsing | Curated in minutes |
| Workload estimation | Optimistic guesses | 4 calibrated tiers |
| Reviewer-ready | After 2-3 revisions | First draft quality |
Setup on MrChief
yamlShow code
skills:
- non-tumor-ml-research-planner
- medical-research-toolkit
- pubmed
Related case studies
Patient
Your AI Doctor on Call β Symptom Assessment Without the 3-Week Wait
An AI agent with the Doctor skill provides instant symptom assessment, emergency recognition, medication safety checks, and first aid guidance β accessible 24/7 through Telegram. It doesn't diagnose, but it tells you whether to call 911, go to urgent care, or schedule an appointment.
Athlete
Your AI Gym Coach β Progressive Overload Tracking That Actually Works
The Gym skill logs every workout, tracks PRs, enforces progressive overload rules (+2.5kg or +1-2 reps per session), adapts for injuries, programs deload weeks, and warns when you're training the same muscle group too soon. Like having a coach in your pocket for $0/month.
Athlete
Apple Health Meets AI β Ask Questions About Your Own Fitness Data
The Apple Health Skill connects your agent to your Apple Health data via the Transition app. Ask natural language questions β "How has my resting heart rate changed this month?" β and get answers based on YOUR actual metrics, not generic advice.
Want results like these?
Start free with your own AI team. No credit card required.