Midterm Tutorial — BST236 Coding Blog

What we built: Type /write-a-paper-using-the-data-in-the-folder and get a publication-ready JAMA paper in under 90 minutes — with zero human intervention. This tutorial teaches you how.

The Big Picture
Prerequisites
Understanding Claude Code's Building Blocks
Step 1: Design the Pipeline
Step 2: Write the Skills
Step 3: Build the Workflow Tools
Step 4: Write the Slash Command
Step 5: Configure the Project
Running the Workflow End-to-End
Design Principles
Bonus: Post-Paper Products
Troubleshooting

Tutorial Video

Watch the full presentation walkthrough (auto-generated with AI narration):

1. The Big Picture

Our workflow has three layers that work together:

SLASH COMMAND The Orchestrator — defines the pipeline .claude/commands/write-a-paper-...md ————————————————————————— AI SKILLS Domain Knowledge — teaches AI HOW to work jama-writing.md data-analysis.md latex-build.md ————————————————————————— WORKFLOW TOOLS Execution Layer — Python scripts that DO things data_profiler.py latex_builder.py

Think of it like a restaurant: the slash command is the head chef's recipe, the skills are the chef's training, and the tools are the kitchen equipment.

The end-to-end pipeline:

Raw Data → 1. Profile → 2. Research Q → 3. Analyze → 4. Write → 5. Compile → 6. QA → paper.pdf

2. Prerequisites

Tool	Why	Install
Claude Code	AI agent that runs the workflow	`npm install -g @anthropic-ai/claude-code`
Python 3.11+	Data analysis & visualization	`conda create -n research python=3.11`
LaTeX	Paper compilation	`brew install --cask mactex`
Python packages	Stats & plotting	`pip install pandas matplotlib seaborn scipy statsmodels`

3. Understanding Claude Code's Building Blocks

3.1 Claude Code

Claude Code is Anthropic's AI coding agent. Unlike regular Claude, it runs in your terminal and can read/write files, execute scripts, and call external APIs — all orchestrated by natural language.

3.2 Skills: Teaching AI Domain Knowledge

Skills are markdown files in .claude/skills/ that give the AI domain expertise. Think of them as giving a new team member a set of reference documents before they start a job.

.claude/skills/
├── jama-writing.md      # "Here's how to write a JAMA paper"
├── data-analysis.md     # "Here's how to pick statistical methods"
└── latex-build.md       # "Here's how to compile LaTeX"

Why skills beat prompts: Skills are persistent (loaded every time), modular (update independently), and testable. This is far more reliable than cramming instructions into a single prompt.

3.3 Commands: Multi-Step Workflows

Commands in .claude/commands/ define entire pipelines. When you type /command-name, Claude reads the file and executes each step autonomously.

3.4 MCP: External API Access

Model Context Protocol lets Claude call external APIs directly. We use PubMed MCP to search papers and fetch citation metadata — ensuring zero hallucinated references.

4. Step 1: Design the Pipeline

Before writing code, plan your stages carefully:

Stage	Input	Output
0. Setup	—	Domain knowledge loaded
1. Data Discovery	Raw data files	`data_profile.json`
2. Research Question	Data profile	`research_question.md`
3. Analysis	Cleaned data	`analysis_report.json`, figures/
4. Paper Writing	All outputs	`paper.tex`, `references.bib`
5. Compilation	LaTeX source	`paper.pdf`
6. Quality Check	Final PDF	Verified paper

Key principle: Make each stage produce a concrete artifact. If something goes wrong at Stage 4, you can inspect Stage 3's JSON output without re-running everything.

5. Step 2: Write the Skills

Skills are the most important part. They make the AI produce consistent, high-quality output.

5.1 JAMA Writing Skill

# JAMA Network Open Writing Guide

## Section Structure (exact order)
1. Title (descriptive, includes study design)
2. Abstract (structured: Importance, Objective, Design, Exposure,
   Main Outcomes, Results, Conclusions)
3. Key Points (Question, Findings, Meaning)
4. Introduction (3-5 paragraphs)
5. Methods, 6. Results, 7. Discussion, 8. Limitations, 9. Conclusions
10. References (Vancouver style, minimum 8 sources)

## Writing Rules
- Use "associated with" — NEVER "caused" or "led to"
- Report effect sizes with 95% CIs
- P-values as exact values: "P = .003" not "P < .05"
- Table 1 is ALWAYS descriptive statistics

5.2 Data Analysis Skill

# Statistical Analysis Guide

## Method Selection
| Data Pattern          | Recommended Method       |
|-----------------------|--------------------------|
| Binary outcome        | Logistic regression      |
| Continuous outcome    | Linear regression        |
| Count data            | Poisson / Neg. binomial  |
| Policy evaluation     | Difference-in-differences|
| Time trends           | Interrupted time series  |

## Key Principle
Let the DATA drive method selection, not a predetermined script.

5.3 LaTeX Build Skill

# LaTeX Compilation Guide

## Build Sequence
1. pdflatex paper.tex  (pass 1)
2. bibtex paper        (resolve references)
3. pdflatex paper.tex  (pass 2)
4. pdflatex paper.tex  (pass 3)

## Common Fixes
- Unescaped _  →  \_
- Missing $   →  Wrap math in $...$
- File not found  →  Check figure paths

How to Write Effective Skills

Be specific: "binary outcome → logistic regression" not "use appropriate methods"
Use tables and checklists: The AI follows structured formats more reliably
Include examples: Show what good output looks like
Anticipate errors: List common mistakes and fixes
One skill per domain: Don't mix writing advice with stats guidance

6. Step 3: Build the Workflow Tools

6.1 Data Profiler

Automatically discovers and profiles any dataset in a directory:

"""data_profiler.py — discovers and profiles datasets"""
import pandas as pd, json, os, sys

def profile_directory(data_dir):
    profile = {"files": []}
    for root, dirs, files in os.walk(data_dir):
        for fname in files:
            fpath = os.path.join(root, fname)
            ext = os.path.splitext(fname)[1].lower()
            if ext in ('.csv', '.xlsx'):
                df = pd.read_csv(fpath) if ext == '.csv' else pd.read_excel(fpath)
                profile["files"].append({
                    "file": fname, "shape": list(df.shape),
                    "columns": list(df.columns),
                    "dtypes": {c: str(df[c].dtype) for c in df.columns},
                    "missing": {c: int(df[c].isna().sum()) for c in df.columns},
                })
            elif ext in ('.txt', '.md'):
                with open(fpath) as f: content = f.read()
                profile["files"].append({
                    "file": fname, "type": "text",
                    "preview": content[:5000]
                })
    return profile

if __name__ == "__main__":
    result = profile_directory(sys.argv[1])
    out = sys.argv[2] if len(sys.argv) > 2 else "data_profile.json"
    json.dump(result, open(out, 'w'), indent=2, default=str)
    print(f"Profiled {len(result['files'])} files")

6.2 LaTeX Builder

Multi-pass compilation with error diagnostics:

"""latex_builder.py — compiles .tex to .pdf"""
import subprocess, os, sys

def build_pdf(tex_path):
    tex_dir = os.path.dirname(os.path.abspath(tex_path))
    tex_name = os.path.splitext(os.path.basename(tex_path))[0]

    for cmd in [
        ["pdflatex", "-interaction=nonstopmode", tex_name],
        ["bibtex", tex_name],
        ["pdflatex", "-interaction=nonstopmode", tex_name],
        ["pdflatex", "-interaction=nonstopmode", tex_name],
    ]:
        subprocess.run(cmd, cwd=tex_dir, capture_output=True)

    pdf = os.path.join(tex_dir, f"{tex_name}.pdf")
    print("Success!" if os.path.exists(pdf) else "Error: PDF not generated")

if __name__ == "__main__":
    build_pdf(sys.argv[1])

Why generic tools? Both scripts are dataset-agnostic. The same tools handle completely different research topics without modification.

7. Step 4: Write the Slash Command

The slash command is the heart of the workflow — a markdown file defining every step:

# .claude/commands/write-a-paper-using-the-data-in-the-folder.md

## Stage 0: Setup
Read skill files: jama-writing.md, data-analysis.md, latex-build.md
Create: exam_paper/figures/, exam_paper/tables/

## Stage 1: Data Discovery
Run: python workflow/data_profiler.py <data_dir> -o exam_paper/data_profile.json

## Stage 2: Research Question
Identify exposure, outcome, population, time frame.
Formulate: "Is [exposure] associated with [outcome] among [population]?"
Save: exam_paper/research_question.md

## Stage 3: Analysis
Select method based on data structure (follow data-analysis skill).
Create Table 1, run primary analysis, generate ≥2 figures at 300 DPI.
Save: analysis_report.json, figures/, tables/

## Stage 4: Paper Writing
Copy template.tex → paper.tex. Fill every section (jama-writing skill).
Search PubMed for 8-12 real references. NEVER fabricate statistics.
Save: paper.tex, references.bib

## Stage 5: Compile
Run: python workflow/latex_builder.py exam_paper/paper.tex

## Stage 6: Quality Check
✓ PDF exists  ✓ ≤10 pages  ✓ No placeholders
✓ ≥2 figures  ✓ Real references  ✓ All CIs reported

Command Design Tips

Be explicit about inputs and outputs at each stage
Include fallback data paths (data might be in different locations)
Specify file formats: "Save as PDF and PNG at 300 DPI"
Add quality gates at the end to catch errors
Reference skill files at the top so they load first

8. Step 5: Configure the Project

your-project/
├── .claude/
│   ├── commands/
│   │   └── write-a-paper-using-the-data-in-the-folder.md
│   └── skills/
│       ├── jama-writing.md
│       ├── data-analysis.md
│       └── latex-build.md
├── workflow/
│   ├── data_profiler.py
│   ├── latex_builder.py
│   └── templates/
├── exam_paper/
│   └── data/          ← drop your dataset here
├── CLAUDE.md
└── Readme.md

The CLAUDE.md file provides project-level instructions that Claude Code reads automatically:

# CLAUDE.md
## Environment
- Python: conda activate new_deep_learning (Python 3.11)
- LaTeX: pdflatex and bibtex

## Key Rules
1. Never hardcode dataset-specific logic
2. All outputs go to exam_paper/
3. Figures: PDF + PNG at 300 DPI
4. Page limit: ≤10 pages
5. No fabrication: all numbers from actual analysis

9. Running the Workflow End-to-End

# 1. Navigate to your project
cd your-project/

# 2. Place your data
cp your-data/*.csv exam_paper/data/

# 3. Start Claude Code (with full permissions)
claude --dangerously-skip-permissions

# 4. Type the magic words
> Write a paper using the data in the folder

That's it. Here's what happens behind the scenes:

Time	Stage	What Happens
0:00	Setup	Read all skill files into context
0:30	Profile	data_profiler.py → data_profile.json
2:00	Research Q	AI formulates PICO hypothesis
5:00	Analysis	Writes & runs analysis.py, generates figures
25:00	Writing	Fills JAMA template + PubMed citations
40:00	Compile	pdflatex → bibtex → pdflatex × 2
45:00	QA	9-point quality checklist → paper.pdf ready

Our Exam Result

During the midterm, the workflow received a dataset about NIH grant funding disruptions (5,419 grants, 15 variables) and produced:

Research question: Predictors of NIH grant reinstatement after the 2025 funding crisis
Analysis: Multivariable logistic regression (pseudo-R² = 0.44)
Key finding: Court involvement was the strongest predictor (OR = 144.6; 95% CI, 69.4–301.6)
Paper: 9 pages, 3 figures, 1 table, 8 verified PubMed references
Time: ~50 minutes, zero human intervention

10. Design Principles

Principle 1: Skills > Prompts

Detailed skill files are dramatically more effective than prompt engineering. Skills are persistent, modular, and testable.

Principle 2: Data Drives Everything

Zero hardcoded logic. The AI reads the data profile, then decides: research question, statistical method, figure count, section content. This makes the workflow truly generic.

Principle 3: Verify, Don't Trust

Statistics: All numbers from analysis_report.json, never from AI memory
Citations: Every reference fetched from PubMed API via MCP
Quality: Automated 9-point checklist catches errors

Principle 4: Concrete Artifacts at Every Stage

Each stage produces an inspectable file. If Stage 4 fails, check Stage 3's JSON — no need to re-run everything.

Principle 5: One Command, One Pipeline

A single end-to-end command is easier to test, more reliable, and more impressive than multiple fragmented commands.

11. Bonus: Post-Paper Products

Automated Slide Deck

We use pptxgenjs to programmatically generate a 13-slide presentation with JAMA-inspired design:

npm install pptxgenjs
node exam_paper/create_slides.js
# → exam_paper/presentation.pptx

Automated Presentation Video

Combine slide images with AI-generated narration:

pip install edge-tts moviepy
brew install --cask libreoffice

python exam_paper/make_video.py
# → exam_paper/presentation.mp4

The pipeline: LibreOffice converts PPTX → images, edge-tts converts script → audio, moviepy combines them → MP4.

12. Troubleshooting

Problem	Cause	Fix
AI gets stuck / loops	Ambiguous skill instruction	Make the skill more specific
LaTeX compilation fails	Unescaped special characters	Check .log file, escape _ & % # $
AI fabricates statistics	Not reading from JSON	Add "ALL stats from analysis_report.json" to command
Citations missing in PDF	bibtex didn't run	Full 4-step compile: pdflatex → bibtex → pdflatex × 2
Paper exceeds 10 pages	Too much content	Trim Discussion, move details to supplement

Conclusion

Building an automated research workflow is less about clever prompts and more about designing a system. The key insight: AI works best when you encode domain knowledge in structured skill files, define clear pipelines with concrete outputs, build generic tools, and verify everything.

Our workflow generated a 9-page JAMA paper in under 50 minutes during a live exam. The same workflow works on any public health dataset without modification.

The future of AI-assisted research isn't about replacing human judgment — it's about automating the tedious parts so researchers can focus on what matters: asking good questions and interpreting the answers.

Resources

Our GitHub Repository — all code, skills, commands, and outputs
Reference: Claude Code Academic Workflow
Claude Skills Development

Built with Claude Code, Python, LaTeX, and a lot of iterating. No AI was harmed in the making of this workflow (though several LaTeX compilation attempts were).

BST236 Coding Blog

📄 Midterm: AI Workflow Tutorial

Table of Contents