BST236 Coding Blog

What we built: Type /write-a-paper-using-the-data-in-the-folder and get a publication-ready JAMA paper in under 90 minutes — with zero human intervention. This tutorial teaches you how.

Tutorial Video

Watch the full presentation walkthrough (auto-generated with AI narration):

1. The Big Picture

Our workflow has three layers that work together:

SLASH COMMAND The Orchestrator — defines the pipeline .claude/commands/write-a-paper-...md ————————————————————————— AI SKILLS Domain Knowledge — teaches AI HOW to work jama-writing.md data-analysis.md latex-build.md ————————————————————————— WORKFLOW TOOLS Execution Layer — Python scripts that DO things data_profiler.py latex_builder.py

Think of it like a restaurant: the slash command is the head chef's recipe, the skills are the chef's training, and the tools are the kitchen equipment.

The end-to-end pipeline:

Raw Data 1. Profile 2. Research Q 3. Analyze 4. Write 5. Compile 6. QA paper.pdf

2. Prerequisites

ToolWhyInstall
Claude CodeAI agent that runs the workflownpm install -g @anthropic-ai/claude-code
Python 3.11+Data analysis & visualizationconda create -n research python=3.11
LaTeXPaper compilationbrew install --cask mactex
Python packagesStats & plottingpip install pandas matplotlib seaborn scipy statsmodels

3. Understanding Claude Code's Building Blocks

3.1 Claude Code

Claude Code is Anthropic's AI coding agent. Unlike regular Claude, it runs in your terminal and can read/write files, execute scripts, and call external APIs — all orchestrated by natural language.

3.2 Skills: Teaching AI Domain Knowledge

Skills are markdown files in .claude/skills/ that give the AI domain expertise. Think of them as giving a new team member a set of reference documents before they start a job.

.claude/skills/
├── jama-writing.md      # "Here's how to write a JAMA paper"
├── data-analysis.md     # "Here's how to pick statistical methods"
└── latex-build.md       # "Here's how to compile LaTeX"

Why skills beat prompts: Skills are persistent (loaded every time), modular (update independently), and testable. This is far more reliable than cramming instructions into a single prompt.

3.3 Commands: Multi-Step Workflows

Commands in .claude/commands/ define entire pipelines. When you type /command-name, Claude reads the file and executes each step autonomously.

3.4 MCP: External API Access

Model Context Protocol lets Claude call external APIs directly. We use PubMed MCP to search papers and fetch citation metadata — ensuring zero hallucinated references.

4. Step 1: Design the Pipeline

Before writing code, plan your stages carefully:

StageInputOutput
0. SetupDomain knowledge loaded
1. Data DiscoveryRaw data filesdata_profile.json
2. Research QuestionData profileresearch_question.md
3. AnalysisCleaned dataanalysis_report.json, figures/
4. Paper WritingAll outputspaper.tex, references.bib
5. CompilationLaTeX sourcepaper.pdf
6. Quality CheckFinal PDFVerified paper

Key principle: Make each stage produce a concrete artifact. If something goes wrong at Stage 4, you can inspect Stage 3's JSON output without re-running everything.

5. Step 2: Write the Skills

Skills are the most important part. They make the AI produce consistent, high-quality output.

5.1 JAMA Writing Skill

# JAMA Network Open Writing Guide

## Section Structure (exact order)
1. Title (descriptive, includes study design)
2. Abstract (structured: Importance, Objective, Design, Exposure,
   Main Outcomes, Results, Conclusions)
3. Key Points (Question, Findings, Meaning)
4. Introduction (3-5 paragraphs)
5. Methods, 6. Results, 7. Discussion, 8. Limitations, 9. Conclusions
10. References (Vancouver style, minimum 8 sources)

## Writing Rules
- Use "associated with" — NEVER "caused" or "led to"
- Report effect sizes with 95% CIs
- P-values as exact values: "P = .003" not "P < .05"
- Table 1 is ALWAYS descriptive statistics

5.2 Data Analysis Skill

# Statistical Analysis Guide

## Method Selection
| Data Pattern          | Recommended Method       |
|-----------------------|--------------------------|
| Binary outcome        | Logistic regression      |
| Continuous outcome    | Linear regression        |
| Count data            | Poisson / Neg. binomial  |
| Policy evaluation     | Difference-in-differences|
| Time trends           | Interrupted time series  |

## Key Principle
Let the DATA drive method selection, not a predetermined script.

5.3 LaTeX Build Skill

# LaTeX Compilation Guide

## Build Sequence
1. pdflatex paper.tex  (pass 1)
2. bibtex paper        (resolve references)
3. pdflatex paper.tex  (pass 2)
4. pdflatex paper.tex  (pass 3)

## Common Fixes
- Unescaped _  →  \_
- Missing $   →  Wrap math in $...$
- File not found  →  Check figure paths

How to Write Effective Skills

  1. Be specific: "binary outcome → logistic regression" not "use appropriate methods"
  2. Use tables and checklists: The AI follows structured formats more reliably
  3. Include examples: Show what good output looks like
  4. Anticipate errors: List common mistakes and fixes
  5. One skill per domain: Don't mix writing advice with stats guidance

6. Step 3: Build the Workflow Tools

6.1 Data Profiler

Automatically discovers and profiles any dataset in a directory:

"""data_profiler.py — discovers and profiles datasets"""
import pandas as pd, json, os, sys

def profile_directory(data_dir):
    profile = {"files": []}
    for root, dirs, files in os.walk(data_dir):
        for fname in files:
            fpath = os.path.join(root, fname)
            ext = os.path.splitext(fname)[1].lower()
            if ext in ('.csv', '.xlsx'):
                df = pd.read_csv(fpath) if ext == '.csv' else pd.read_excel(fpath)
                profile["files"].append({
                    "file": fname, "shape": list(df.shape),
                    "columns": list(df.columns),
                    "dtypes": {c: str(df[c].dtype) for c in df.columns},
                    "missing": {c: int(df[c].isna().sum()) for c in df.columns},
                })
            elif ext in ('.txt', '.md'):
                with open(fpath) as f: content = f.read()
                profile["files"].append({
                    "file": fname, "type": "text",
                    "preview": content[:5000]
                })
    return profile

if __name__ == "__main__":
    result = profile_directory(sys.argv[1])
    out = sys.argv[2] if len(sys.argv) > 2 else "data_profile.json"
    json.dump(result, open(out, 'w'), indent=2, default=str)
    print(f"Profiled {len(result['files'])} files")

6.2 LaTeX Builder

Multi-pass compilation with error diagnostics:

"""latex_builder.py — compiles .tex to .pdf"""
import subprocess, os, sys

def build_pdf(tex_path):
    tex_dir = os.path.dirname(os.path.abspath(tex_path))
    tex_name = os.path.splitext(os.path.basename(tex_path))[0]

    for cmd in [
        ["pdflatex", "-interaction=nonstopmode", tex_name],
        ["bibtex", tex_name],
        ["pdflatex", "-interaction=nonstopmode", tex_name],
        ["pdflatex", "-interaction=nonstopmode", tex_name],
    ]:
        subprocess.run(cmd, cwd=tex_dir, capture_output=True)

    pdf = os.path.join(tex_dir, f"{tex_name}.pdf")
    print("Success!" if os.path.exists(pdf) else "Error: PDF not generated")

if __name__ == "__main__":
    build_pdf(sys.argv[1])

Why generic tools? Both scripts are dataset-agnostic. The same tools handle completely different research topics without modification.

7. Step 4: Write the Slash Command

The slash command is the heart of the workflow — a markdown file defining every step:

# .claude/commands/write-a-paper-using-the-data-in-the-folder.md

## Stage 0: Setup
Read skill files: jama-writing.md, data-analysis.md, latex-build.md
Create: exam_paper/figures/, exam_paper/tables/

## Stage 1: Data Discovery
Run: python workflow/data_profiler.py <data_dir> -o exam_paper/data_profile.json

## Stage 2: Research Question
Identify exposure, outcome, population, time frame.
Formulate: "Is [exposure] associated with [outcome] among [population]?"
Save: exam_paper/research_question.md

## Stage 3: Analysis
Select method based on data structure (follow data-analysis skill).
Create Table 1, run primary analysis, generate ≥2 figures at 300 DPI.
Save: analysis_report.json, figures/, tables/

## Stage 4: Paper Writing
Copy template.tex → paper.tex. Fill every section (jama-writing skill).
Search PubMed for 8-12 real references. NEVER fabricate statistics.
Save: paper.tex, references.bib

## Stage 5: Compile
Run: python workflow/latex_builder.py exam_paper/paper.tex

## Stage 6: Quality Check
✓ PDF exists  ✓ ≤10 pages  ✓ No placeholders
✓ ≥2 figures  ✓ Real references  ✓ All CIs reported

Command Design Tips

  1. Be explicit about inputs and outputs at each stage
  2. Include fallback data paths (data might be in different locations)
  3. Specify file formats: "Save as PDF and PNG at 300 DPI"
  4. Add quality gates at the end to catch errors
  5. Reference skill files at the top so they load first

8. Step 5: Configure the Project

your-project/
├── .claude/
│   ├── commands/
│   │   └── write-a-paper-using-the-data-in-the-folder.md
│   └── skills/
│       ├── jama-writing.md
│       ├── data-analysis.md
│       └── latex-build.md
├── workflow/
│   ├── data_profiler.py
│   ├── latex_builder.py
│   └── templates/
├── exam_paper/
│   └── data/          ← drop your dataset here
├── CLAUDE.md
└── Readme.md

The CLAUDE.md file provides project-level instructions that Claude Code reads automatically:

# CLAUDE.md
## Environment
- Python: conda activate new_deep_learning (Python 3.11)
- LaTeX: pdflatex and bibtex

## Key Rules
1. Never hardcode dataset-specific logic
2. All outputs go to exam_paper/
3. Figures: PDF + PNG at 300 DPI
4. Page limit: ≤10 pages
5. No fabrication: all numbers from actual analysis

9. Running the Workflow End-to-End

# 1. Navigate to your project
cd your-project/

# 2. Place your data
cp your-data/*.csv exam_paper/data/

# 3. Start Claude Code (with full permissions)
claude --dangerously-skip-permissions

# 4. Type the magic words
> Write a paper using the data in the folder

That's it. Here's what happens behind the scenes:

TimeStageWhat Happens
0:00SetupRead all skill files into context
0:30Profiledata_profiler.py → data_profile.json
2:00Research QAI formulates PICO hypothesis
5:00AnalysisWrites & runs analysis.py, generates figures
25:00WritingFills JAMA template + PubMed citations
40:00Compilepdflatex → bibtex → pdflatex × 2
45:00QA9-point quality checklist → paper.pdf ready

Our Exam Result

During the midterm, the workflow received a dataset about NIH grant funding disruptions (5,419 grants, 15 variables) and produced:

10. Design Principles

Principle 1: Skills > Prompts

Detailed skill files are dramatically more effective than prompt engineering. Skills are persistent, modular, and testable.

Principle 2: Data Drives Everything

Zero hardcoded logic. The AI reads the data profile, then decides: research question, statistical method, figure count, section content. This makes the workflow truly generic.

Principle 3: Verify, Don't Trust

Principle 4: Concrete Artifacts at Every Stage

Each stage produces an inspectable file. If Stage 4 fails, check Stage 3's JSON — no need to re-run everything.

Principle 5: One Command, One Pipeline

A single end-to-end command is easier to test, more reliable, and more impressive than multiple fragmented commands.

11. Bonus: Post-Paper Products

Automated Slide Deck

We use pptxgenjs to programmatically generate a 13-slide presentation with JAMA-inspired design:

npm install pptxgenjs
node exam_paper/create_slides.js
# → exam_paper/presentation.pptx

Automated Presentation Video

Combine slide images with AI-generated narration:

pip install edge-tts moviepy
brew install --cask libreoffice

python exam_paper/make_video.py
# → exam_paper/presentation.mp4

The pipeline: LibreOffice converts PPTX → images, edge-tts converts script → audio, moviepy combines them → MP4.

12. Troubleshooting

ProblemCauseFix
AI gets stuck / loopsAmbiguous skill instructionMake the skill more specific
LaTeX compilation failsUnescaped special charactersCheck .log file, escape _ & % # $
AI fabricates statisticsNot reading from JSONAdd "ALL stats from analysis_report.json" to command
Citations missing in PDFbibtex didn't runFull 4-step compile: pdflatex → bibtex → pdflatex × 2
Paper exceeds 10 pagesToo much contentTrim Discussion, move details to supplement

Conclusion

Building an automated research workflow is less about clever prompts and more about designing a system. The key insight: AI works best when you encode domain knowledge in structured skill files, define clear pipelines with concrete outputs, build generic tools, and verify everything.

Our workflow generated a 9-page JAMA paper in under 50 minutes during a live exam. The same workflow works on any public health dataset without modification.

The future of AI-assisted research isn't about replacing human judgment — it's about automating the tedious parts so researchers can focus on what matters: asking good questions and interpreting the answers.

Resources

Built with Claude Code, Python, LaTeX, and a lot of iterating. No AI was harmed in the making of this workflow (though several LaTeX compilation attempts were).