I built my own Socratic tutor instead of taking a course

The learning science behind it, why courses and even a raw chatbot get it wrong, and a theory of using AI to learn — built on a system that derives my mastery from evidence I can't fudge.

TL;DR

Most learning fails for one structural reason: the learner is both the player and the referee, and the referee is biased. You stop when material feels familiar — but familiarity isn't competence. So instead of taking a course on AI engineering, I built a Socratic tutor that probes me adversarially, refuses to let me self-report, and derives my mastery from an append-only log of evidence. This essay is the why: the learning science it's built on, why courses and even a raw chatbot get it wrong, a defense of every design choice, and the theory it converges on — the model is a tutor, not an oracle; you can't be your own referee; define done as a shipped, evaluated artifact.

I decided to learn AI engineering properly. The honest first instinct was to find a good course, watch it at 1.5×, nod along, and feel like I was making progress. I've done that before. I know how it ends: a folder of finished videos and the uneasy sense that I couldn't actually build the thing if you sat me down and asked.

So I didn't take the course. I spent a weekend building a tutor instead — a few hundred lines of dependency-free Python wrapped around a language model — and then I let it teach me, and grade me, for weeks. This is the argument for why that's not a gimmick. It starts with the single most uncomfortable fact about how people learn.

You are a biased referee

Here is the failure mode underneath almost every abandoned learning project. When you study, you are simultaneously the person doing the work and the person judging whether the work is done. And the judge has a broken instrument.

The instrument is fluency. When you re-read a chapter or re-watch a lecture, the material gets easier to process each pass. Your brain reads that increasing ease as understanding — "yes, I know this." But ease of recognition is not the same thing as the ability to produce the idea cold, unprompted, and apply it to a problem you haven't seen. Cognitive psychologists call this the illusion of competence, and it is brutally robust: the strategies that feel the most productive while studying (highlighting, re-reading) are among the least effective for durable learning, precisely because they maximize fluency while doing almost nothing for retrieval.

The consequence is that self-assessment runs hot. You feel ready before you are. And because you're the referee, you blow the whistle and move on. Nobody is in the room to say "you don't actually have this yet — show me again, differently."

The core defect isn't laziness or low intelligence. It's a measurement problem. The one number that should govern whether you move on — do I really understand this? — is generated by the one party with every incentive, conscious or not, to inflate it.

That single observation drove the entire design of what I built. If self-reported mastery is unreliable, then the system has to take the grade out of my hands.

What the learning research actually says to do

There's a tendency to treat "how to learn" as folk wisdom. It isn't — there's a substantial empirical literature, and it's surprisingly consistent. Four findings did the most to shape the design.

Retrieval beats review. The single most replicated result in the science of learning is the testing effect: the act of pulling information out of your head — answering a question, reconstructing an argument — produces far more durable learning than putting information in again by re-reading. Testing isn't measurement of learning; testing is learning. A tutor that quizzes you is not assessing on the side; the quiz is the mechanism.

Generation beats reception. Closely related: you remember what you produce better than what you're handed. Struggling to derive an answer — even getting it wrong first — beats being told the answer cleanly. This is why a tutor that explains too early is actively counterproductive: every time it hands you the conclusion, it robs you of the generation that would have made it stick.

Difficulty, in the right places, is desirable. Robert Bjork's notion of desirable difficulties — spacing practice out over time, interleaving topics, forcing recall before review — makes learning feel harder and slower in the moment while making it dramatically more durable. The implication is counterintuitive and important: a learning tool that optimizes for a smooth, frictionless, satisfying session is optimizing for the wrong thing. You want productive friction.

One-to-one tutoring is the gold standard, and the only problem was cost. In 1984 Benjamin Bloom published what's now called the 2-sigma problem: students tutored one-to-one, using mastery-based methods, performed about two standard deviations better than students in conventional classrooms — the average tutored student outperforming ~98% of the group-taught ones. Two sigma is an enormous effect. Bloom framed it as an open challenge: tutoring obviously works, but you can't afford a personal expert tutor for every learner, so how do you reproduce that effect at scale? For forty years the answer was "you mostly can't."

Underneath all of it sits Vygotsky's zone of proximal development — the band of problems just beyond what you can do alone but within reach with help — and Ericsson's deliberate practice — improvement comes not from volume but from targeted effort at the edge of your ability, with immediate, specific feedback. Both require something most self-study lacks: a knowledgeable other who can sit exactly at your edge and push.

Hold these together and a specification falls out. Good learning is retrieval-heavy, generative, desirably difficult, mastery-gated, and one-to-one at the edge of your ability. Now look at how we actually try to learn.

Why the systems we have underdeliver

Almost every popular learning format violates that specification in a specific, diagnosable way.

System	What it optimizes	Where it breaks
Video courses / MOOCs	coverage and completion	passive reception, zero retrieval; "finished" measures watching, not knowing; one pace for everyone
Reading docs / books	breadth, reference	pure input; fluency illusion at its strongest
Tutorials / "follow along"	a working result on screen	you copy keystrokes; the result is the tutor's, not yours — tutorial hell
Flashcards / SRS (Anki)	long-term recall of facts	excellent for recognition and recall; weak on reasoning and transfer; you still grade your own card
Bootcamps / cohort courses	structure, accountability	better, but group-paced; the feedback isn't 1:1 or continuous

The pattern is that each one optimizes a proxy — videos watched, cards reviewed, modules completed — and the proxy drifts away from the thing you actually want, which is the ability to do the work cold. Completion is not competence. Coverage is not mastery. And almost none of them solve the referee problem: in every self-directed format, you remain the biased judge of your own progress.

This is the gap. Not a content gap — the material is all free and abundant. A measurement and feedback gap: nobody is sitting at your edge, continuously, refusing to let you confuse familiarity for understanding.

That is exactly the gap a one-to-one tutor fills. Which brings us back to Bloom's 2-sigma, and to the only thing that has actually changed.

The obvious idea — and why a raw chatbot isn't it

The obvious idea: a language model is a patient, infinitely available, near-free expert that will talk to you one-to-one for as long as you want about almost anything. That is, on paper, the answer to the 2-sigma cost problem. The thing that made tutoring unscalable — a human expert's time — just got cheap.

But "open a chat window and ask it to teach you" produces a worse tutor than it looks, because the default behavior of a helpful assistant is the precise opposite of good pedagogy on almost every axis:

It explains before you retrieve. Ask a question, it answers — immediately, completely. That feels great and teaches little: it hands you the conclusion and skips the generation that would have made it stick.
It is sycophantic by default. These models are trained to be agreeable and encouraging. A vague, half-right answer gets a "great point!" and a gentle addition. That is the fluency illusion delivered by a machine — it tells you you're getting it when you aren't.
It will do the work for you. Stuck on a problem? It'll just solve it. Convenient, and it quietly removes the desirable difficulty that was the whole point.
It flatters your pace. It won't notice that you've been circling a comfortable topic to avoid a hard one. It has no model of your edge, and no spine to push you toward it.

So the raw tool, used naively, optimizes for a pleasant session — and a pleasant session is the wrong objective. The intelligence is necessary but not sufficient. The pedagogy has to be engineered on top of it. That engineering is the actual project.

What I built

The system is small on purpose: an append-only event log, a handful of pure functions that derive state from that log, and a system prompt that turns the model into a hostile-but-fair Socratic examiner. The learner — me — only ever sees a rendered dashboard. The loop looks like this:

The learning loop — every exchange produces one logged observation; mastery is derived, never declared. The self-report shortcut is deliberately disabled.

Concretely, the pieces are:

A Socratic system prompt. It opens every topic with a symptom to diagnose, not a definition — "a model can't count the r's in 'strawberry'; reason out why." It probes before it explains. When it asks a question, it stops and waits — it is forbidden from asking and answering in the same breath. It pushes one level deeper before it believes me. When I'm wrong it says so plainly and shows the mechanism, with no softening.
An append-only evidence log. Every meaningful exchange appends one line: the concept, a quality score for that single exchange, any misconception opened or resolved, a note. It runs as a Claude Code project, and there's exactly one write path — a single command that validates the entry, appends it, recomputes all derived state, regenerates the dashboard, and git-commits, atomically. Nothing is ever edited in place; the log is the source of truth, every exchange is an undoable step, and the whole of my state is a pure function of that log.
Mastery is derived, not declared. This is the load-bearing choice. I never set a mastery number, and neither does the tutor in any single moment. Mastery per concept is a recency-weighted average of the logged per-exchange observations. One good answer barely moves it; sustained ones do. A concept untouched for a couple of weeks gets flagged stale and re-probed.
"Done" is a shipped, evaluated build — not coverage. The finish line isn't a quiz score or a checklist of topics. It's a capstone: ship a real system with an evaluation harness and a documented before/after metric. The conceptual quizzes are table stakes; the bar is an artifact that works.
Named failure patterns. The tutor watches for specific avoidance moves — depth-as-procrastination (polishing a topic I've already got to avoid the next one), completion-seeking (wanting to "cover everything" before moving on), premature convergence (committing to the first explanation), theory-over-shipping (lingering in mechanism because it's more comfortable than building). When one fires, it names it out loud.

Why every one of those choices is the pedagogy

None of that is arbitrary. Each piece maps directly onto a finding from the research, and most of them exist specifically to defeat a default failure of the raw model.

Problem-first and probe-before-explain are the testing and generation effects, operationalized. The tutor refuses to hand me a conclusion I could have produced myself, because the producing is where the learning is.

Ask, then stop is the most important rule and the one a chatbot violates by reflex. A model's instinct is to ask a rhetorical question and immediately answer it. That turns a retrieval opportunity back into passive reception. Forcing the model to ask and halt is what preserves the generation.

Adversarial, no softening is a direct countermeasure to sycophancy. The default model rewards a vague answer; this one rejects it. When I got caught in a precise error and tried to restate it as something vaguer and defensible — a motte-and-bailey retreat — the tutor named the move instead of accepting the softer claim. That refusal is the difference between a tutor and a flatterer, and it's the difference between learning and feeling like you learned.

Derived mastery is the cure for the biased referee. I cannot inflate the number because I never touch it; the tutor can't inflate it in a moment of generosity because no single exchange is decisive — mastery is a sustained average. The one metric with every incentive to run hot has been taken out of both of our hands and handed to a pure function over recorded evidence. This is the whole reason the system can be trusted at all.

Recency-weighting and the stale flag build desirable difficulty into the structure: knowledge decays in the model exactly as it decays in my head, so the system spaces and re-tests on its own instead of letting a once-strong answer stand forever.

Tracking misconceptions is deliberate practice made concrete — it aims the next session at my actual edge, the specific thing I get wrong, rather than the comfortable middle I'd drift toward.

Naming the avoidance patterns is metacognitive scaffolding. Half of self-directed learning is managing your own avoidance, and you can't manage what you can't see. Giving each evasion a name drags it into view; it's much harder to hide in "let me just go deeper here" once a system is willing to call it procrastination to your face.

Defining done as a shipped, evaluated artifact is constructive alignment — the principle that the assessment quietly defines what's actually learned. If the goal were a quiz score, I'd learn to pass quizzes. By making the goal a working, measured system, the assessment is the competence, and there's no proxy left to game. It also happens to be the exact discipline of the work I keep being drawn to: evaluation-driven engineering. The method and the subject matter are the same shape.

The reason this isn't just "a chatbot with a strict prompt" is the evidence log. The prompt makes the conversation good; the log makes the judgment honest. Without externalized, append-only evidence and derived mastery, I'm back to being my own referee — just with a more articulate study partner.

Does it work? Here it is, catching me

The honest test of a learning system isn't whether it feels good. It's whether it catches you being wrong in ways you'd otherwise skate past. Two real examples from my own log — which is public, weak spots and all:

The temperature wobble. Working through how sampling temperature shapes a model's output distribution, I confidently claimed that as temperature rises toward infinity, the top token's probability climbs toward 1. That's backwards. High temperature flattens the distribution toward uniform; it's low temperature, toward zero, that spikes the top token to 1. Worse, my wrong claim contradicted a property I'd accepted minutes earlier — that the ranking of tokens is preserved. A sycophantic tutor says "close!" and moves on. This one stopped, showed me the mechanism that made it wrong, and logged the misconception. About a week later it re-probed the same spot cold, with no scaffolding, and that time I derived it correctly and unprompted. That's the entire loop in one arc: a wobble caught, named, logged, spaced, and re-tested until it actually stuck — not until it felt stuck.

The relapse it caught twice. The single most load-bearing idea in the platform work I keep gravitating toward is that a model's knowledge lives in its frozen weights — the training corpus itself is gone at inference. Twice, on different concepts, I reached for the training data as if it were present at runtime ("it recalls from the training data"). Both times the tutor flagged it as the same recurring misconception, not a fresh one — because the log remembered the first occurrence. This is the thing a fresh chat window structurally cannot do: a new session has no memory of last week's error, so it can't tell a one-off slip from a pattern you keep relapsing into. The evidence log can. That's the line between a study buddy and an examiner.

And because mastery is derived, none of this is flattering. The dashboard shows concepts mid-progress, misconceptions still open, the places I'm loose. I cleared the most platform-relevant checkpoint — predict which of two prompts costs more, and why — but "cleared" meant reconciling per-token billing against the quadratic compute underneath it, under questioning, logged as evidence. Not me ticking a box. The system is built so that neither of us can tick the box; the only way a concept goes green is sustained, recorded performance.

This is also why the dashboard is public. The whole premise of learning in the open is that the unflattering parts — the open misconceptions, the concepts still mid-climb — are the most honest signal there is. A progress page with no weak spots on it isn't a learning record; it's marketing.

A theory of using AI to learn

Generalize past my specific setup and a small, opinionated theory falls out. If you want to use a language model to genuinely learn something — not to feel productive — these are the load-bearing principles.

The model is a tutor, not an oracle. Its value isn't the answers it can hand you; it's the questions it can ask you. Configure it to interrogate, not to inform. The moment it's doing the producing, you've stopped learning.
Engineer the adversary. A helpful assistant's defaults — agreeable, explanatory, eager to solve — are pedagogically backwards. You have to deliberately instruct it to withhold, to push, to refuse vague answers, to tell you you're wrong. Sycophancy is the enemy of calibration.
You cannot be your own referee. Self-reported understanding is the one measurement guaranteed to be biased. Externalize the evidence and derive the verdict from it; never let yourself (or the model, in a generous moment) simply declare mastery.
Optimize for retrieval and transfer, not recognition. The test of knowing isn't whether the material looks familiar; it's whether you can produce and apply it cold. Design every interaction around pulling things out of your head, not putting them back in.
Define "done" as a shipped, evaluated artifact. Coverage and completion are proxies that drift. Anchor the finish line to something that either works or doesn't, with a number attached. The assessment will become the real curriculum, so make the assessment the real thing.
Build the tool, don't just use the chat. The deepest move is that constructing the measurement system is itself the hardest, most clarifying part of the learning. You can't build an honest evaluator for a domain you don't understand. The tool is both the method and your first real project in the field.

The takeaway

The thing that changed isn't that AI knows things — the internet already knew things. It's that one-to-one, mastery-based tutoring, the most effective form of instruction we've ever measured, just stopped being expensive. But a raw model is a flatterer, not a tutor. You get the two-sigma effect only if you engineer the pedagogy on top of it — and the most honest way to do that is to take the grade out of your own hands and derive it from evidence you can't fudge.

There's a neatness to where this landed that I didn't plan. I set out to learn AI engineering, and the most effective way to do it turned out to be building an AI system — one whose entire premise is evaluation: logging evidence, deriving a metric, gating "done" on a measured result. That's not a study aid I bolted onto the real work. It's a small, working instance of the exact discipline I'm trying to learn. The fastest way into a field really is to build the tool you wish existed in it — and then let it hold you to a standard you wouldn't hold yourself to.