Model Mechanism

Learning state
updated 2026-06-21T22:45
sessions 4 · events 30
built 2026-06-21 22:45
Exit test — the deliverable
1/8
Overall mastery
7%
Concepts mastered
0/72
01

The inference loop

The spine. Everything hangs off this.

Tokenization (BPE) 63%
letter-counting failures · JSON corruption · multilingual cost asymmetry
1 resolved
Autoregressive loop 58%
sequential latency · streaming · can't revise emitted tokens
1 resolved
Logits → softmax → sampling 70%
nondeterminism · creativity · repetition loops
3 resolved
02

Attention & the transformer

Conceptual, no QKV math.

Self-attention (intuition) 79%
O(n²) context cost · order sensitivity · in-context learning
Layers & residual stream 70%
composition with depth
1 resolved
Positional encoding / RoPE 63%
lost in the middle · position bias · context extension
Context window / KV cache 70%
context ≠ memory · cost scaling · context rot
03

Why it behaves that way

The insight layer. Most important.

Hallucination is structural 0%
confident wrong citations · confidence ≠ correctness
no data yet
In-context learning 0%
few-shot with no weight updates · induction
no data yet
04

Training, load-bearing parts only

Causal story, no math.

Pretraining 0%
knowledge cutoff · parametric vs retrieved knowledge
no data yet
Post-training (SFT → RLHF/DPO) 0%
assistant persona · refusals · sycophancy
no data yet
05

Variants & edge cases

Practical.

Reasoning / test-time compute 0%
scratchpad tokens · different cost/latency class
no data yet
Mixture of Experts 0%
fast for its size · routing nondeterminism
no data yet
Quantization 60%
smaller/faster/measurably dumber
Multimodal 0%
images → patch embeddings → tokens
no data yet
Why temp=0 isn't reproducible 0%
floating point · batching · MoE routing
no data yet
06

L2 — Context & prompt engineering

Get the most from the window without touching weights.

Prompt structure & roles 0%
system/user/assistant roles · instruction placement · delimiters & formatting
no data yet
Few-shot / in-context examples 0%
example selection · format consistency · when examples beat instructions
no data yet
Chain-of-thought & decomposition 0%
scratchpad reasoning · task decomposition · when CoT helps vs hurts
no data yet
Structured output 0%
function calling / JSON mode · schema-constrained decoding · validity failure modes
no data yet
Context engineering 0%
what to put in the window · ordering vs lost-in-the-middle · retrieval vs stuffing
no data yet
Prompt injection (intro) 0%
untrusted input in the prompt · instruction-override risk
no data yet
Automated prompt optimization 0%
DSPy-style optimization · eval-driven prompt search · stop hand-tuning
no data yet
07

L3 — Retrieval / RAG

Knowledge from the corpus, not the weights.

Embeddings & vector space 0%
semantic similarity as distance · embedding models · dimensionality
no data yet
Chunking strategies 0%
size & overlap · semantic vs fixed · chunk boundaries
no data yet
Vector stores & ANN 0%
approximate nearest neighbour · index types · recall vs latency tradeoff
no data yet
Hybrid search 0%
BM25 + dense · when lexical beats semantic
no data yet
Reranking 0%
cross-encoders · two-stage retrieve-then-rerank · why a second stage
no data yet
Retrieval evaluation 0%
recall@k · MRR · context relevance
no data yet
RAG failure modes 0%
irrelevant retrieval · stale index · context dilution
no data yet
Document ingestion & parsing 0%
PDF/HTML/code parsing · OCR & tables · cleaning: garbage-in, garbage-out
no data yet
Query transformation 0%
HyDE · multi-query & decomposition · query rewriting
no data yet
Metadata filtering 0%
structured filters + semantic search · access control in retrieval
no data yet
Advanced retrieval patterns 0%
contextual retrieval · GraphRAG · parent-doc / small-to-big
no data yet
08

L5 — Evaluation

The differentiator. A lens, not a final topic — introduced early, applied everywhere.

The eval mindset 0%
offline vs online · measure before/after · no eyeballing
no data yet
Golden / eval-set construction 0%
coverage · edge cases · labeling quality
no data yet
LLM-as-judge 0%
rubric prompting · judge biases (position/verbosity/self-preference) · when judges fail
no data yet
Task metrics 0%
exact vs semantic match · rubric scoring · pass@k
no data yet
Regression testing / CI gates 0%
prompt/chain regression suites · CI gates · catching silent drift
no data yet
Online eval & monitoring 0%
production monitoring · drift detection · A/B testing
no data yet
Evaluating RAG & agents 0%
component vs end-to-end · trajectory evaluation · beyond single completions
no data yet
Build an eval harness 0%
30+ case harness · error bars · before/after deltas
no data yet
Synthetic data generation 0%
generating eval cases · augmentation & edge-case mining · data for tuning
no data yet
The eval data flywheel 0%
mine production traces into tests · failure-driven test growth · continuous eval-set expansion
no data yet
09

L4 — Agents & orchestration

Tools, loops, memory — and when a single call is better.

Tool use / function calling 0%
tools as agency · tool schemas · tool selection
no data yet
ReAct & plan-then-execute 0%
reason-act loops · plan then execute · when to stop
no data yet
State & memory 0%
working vs persistent memory · context is not memory at the app layer · when each is needed
no data yet
Multi-agent patterns 0%
decomposition across agents · when one agent is better · coordination cost
no data yet
Orchestration & MCP 0%
graphs / state machines (LangGraph) · MCP tool/context protocol · deterministic vs model-driven control
no data yet
Agent failure modes 0%
loops · runaway cost · error propagation · silent wrong-tool
no data yet
Human-in-the-loop & approvals 0%
approval gates before actions · confidence-based escalation · review queues
no data yet
10

L6 — Inference ops / production

Bounded latency & cost, graceful degradation, fully traced.

Latency budgeting & token accounting 0%
prefill vs decode · time-to-first-token vs total · token budgets
no data yet
Cost modeling 0%
per-request/user/at-scale · input vs output token cost · context cost scaling
no data yet
Streaming 0%
token streaming UX · partial parsing
no data yet
Caching 0%
prompt caching · semantic caching · when each applies
no data yet
Reliability 0%
retries/timeouts/fallbacks · circuit breaking · structured-output reliability at scale
no data yet
Rate limits & batching 0%
rate-limit handling · batching · throughput
no data yet
Observability / tracing 0%
spans · token + cost per span · tracing chains & agents
no data yet
Model selection & routing 0%
cascades: cheap-first, escalate · fallback chains · which model per call
no data yet
Serving open-weight models 0%
vLLM / TGI · self-host vs API tradeoff · throughput basics
no data yet
Deployment & CI/CD for AI 0%
prompt/chain versioning · staged rollout · gating deploys on evals
no data yet
11

L8 — Safety & guardrails

Adversarial input is the default, not the exception.

Prompt injection & jailbreaks (defense) 0%
direct vs indirect injection · defense patterns · isolating untrusted content
no data yet
Output validation & refusals 0%
schema enforcement · refusal handling · fail-closed
no data yet
PII & data governance 0%
PII detection/redaction · retention policy · logging hygiene
no data yet
Content moderation 0%
moderation layers · policy enforcement
no data yet
Adversarial robustness basics 0%
attack-surface mapping · red-teaming mindset
no data yet
Hallucination mitigation & grounding 0%
forced citations · abstention / I-don't-know · verification passes
no data yet
12

L7 — Adaptation / fine-tuning

Lowest priority for the app layer; knowing when NOT to is the skill.

Fine-tune vs RAG vs prompt 0%
the decision framework · cost/benefit framing · when each wins
no data yet
SFT and LoRA / PEFT (conceptual) 0%
what SFT changes · LoRA/PEFT idea (no math) · adapter swapping
no data yet
Preference tuning / DPO (conceptual) 0%
preference data · DPO vs RLHF idea (no math)
no data yet
Distillation 0%
teacher to student · why distill
no data yet
Data curation for tuning 0%
dataset quality · data beats technique
no data yet

Exit test — course complete at 4/4

  • Explain mechanistically why a model produces a confident, wrong citation
    Hallucination is structuralLogits → softmax → samplingPretraining
    3 concepts to go
  • Given two prompts, predict which costs more and why1 attempt
    Tokenization (BPE)Self-attention (intuition)Context window / KV cache
  • Explain why the same prompt at temp=0 returned two different answers
    Logits → softmax → samplingWhy temp=0 isn't reproducible
    2 concepts to go
  • Explain why last week's fact isn't in the model but works once pasted into context
    PretrainingContext window / KV cacheIn-context learning
    3 concepts to go
  • B1 (build): turn messy dev artifacts (logs, stack traces, API docs) into validated JSON at ~100%, no fine-tuning
    Structured outputContext engineering
    2 concepts to go
  • B2 (build): docs/code RAG over a real repo with a retrieval eval harness that proves a measured improvement
    RAG failure modesRetrieval evaluationAdvanced retrieval patternsThe eval mindset
    4 concepts to go
  • B3 (build): a dev-tools agent (PR review / log triage) with human approval gates that recovers from tool failure
    Agent failure modesTool use / function callingHuman-in-the-loop & approvalsPrompt injection & jailbreaks (defense)
    4 concepts to go
  • B4 CAPSTONE: ship a cloud service from B2/B3 - model routing, caching, full tracing, a 30+ case eval gating CI, and a documented before/after metric
    Build an eval harnessModel selection & routingDeployment & CI/CD for AIObservability / tracingOnline eval & monitoring
    5 concepts to go

Bug patterns (tutor's read)

  • depth-as-procrastinationdormant

    treats 'I could go deeper here' as a reason to stay on a mastered concept; optimization bias operating on the syllabus

  • completion-seekingdormant

    wants full coverage before moving on; control preference. Redirect to the exit test, not coverage

  • premature-convergencedormant

    closes options / commits to an explanation before testing alternatives; efficiency over exploration