What Building a Baby Monitor Taught Me About ML and Coding Agents

2026-04-1313 min read

This is part 2 of a two-part series. Part 1 covered what BILBO is, how it's put together, and the tradeoffs with measured numbers. This post is about what I learned — ML lessons, the bugs that cost me a morning each, and what working with a coding agent actually felt like.

If you haven't read part 1, the short version is: BILBO is a local-only baby monitor running on a 2014 Mac mini, with a 3-stage MobileNetV3 cascade that decides whether the baby is awake, and a Flask dashboard for reviewing and correcting the model's mistakes. It cost ~$40 in hardware. The rest of this post assumes you've seen that context.

I wanted to split this out from the build write-up because the lessons are the part I'd have actually wanted to read before starting, and they kept getting buried under the architecture diagrams. So here they are on their own, in roughly the order I learned them.

Lessons about ML practice

Real ML projects are 10% choosing models and 90% choosing tradeoffs. These are the tradeoffs that mattered.

Precision vs recall is not a single dial — it's a different dial per alert

The temptation is to pick a confidence threshold for the model, run it through the precision-recall curve, and find the "best" point. That's wrong. Different alerts need different operating points.

For wake detection (the Telegram ping when the baby seems to be waking up), I weight recall heavily. Missing a real wake event is much worse than getting a false alarm — I'd rather check the dashboard once for nothing than not check it when the baby is crying. So the wake alert is gated on a relatively loose 2-of-3 rule.

For an edge alert I tried later (more on that in a minute), I weighted precision heavily. False alarms degrade trust in the entire system — a baby monitor that pings me for nothing twice a day will eventually be muted, and once it's muted it might as well not exist. So that alert needed precision near 1.0 even if recall suffered.

The lesson: don't think about "the model's threshold." Think about each downstream alert's cost matrix and pick the threshold that fits. The model is the same; the threshold is the policy.

The model gets the obvious cases. The edge cases are 100% of what matters

My first few classifiers had numbers that looked great in aggregate. 92% accuracy! 90% F1! Then I'd look at where they were wrong, and they were always wrong on:

Partially closed eyes (the baby blinking, half-asleep)
Faces in deep shadow (one side of the face dark, other side lit)
Motion blur (baby just moved their head)
Profile views (baby turned 90 degrees, only one eye visible)
Unusual angles (baby flopped sideways with face partially against the mattress)

These are the cases that matter most. The "easy" cases — eyes wide open looking at the ceiling, eyes squeezed shut while sleeping — are where the cloud API agreed with my model 100% of the time. They were wasted training signal. The hard cases were the actual product.

The lesson here is something I now believe deeply: aggregate metrics lie about hobby ML projects. A confusion matrix on a held-out test set is fine for a paper, but the test set is biased toward the easy cases by construction. The only way to actually know how your model is doing is to look at every frame it got wrong and ask "would I have gotten this right if I were squinting at it?"

Performance vs accuracy is usually a fake tradeoff

In part 1 I showed the ablation table: bigger models and larger inputs gave me no measurable gain on eye-state F1, at 2–3× the latency. I want to restate the lesson behind those numbers directly, because it's the one I most wish someone had told me when I started:

When your accuracy is plateauing, the answer is almost never "bigger model." The answer is usually "give the model better inputs." Crop tighter. Train on the actual hard cases. Add a preprocessing step. Bigger models are the most expensive way to fix a data problem.

On BILBO, the thing that unstuck eye-state accuracy wasn't model capacity — it was adding a face detector so the eye-state classifier saw a 40-pixel face crop instead of an 8-pixel-tall eye region in a wide bassinet view. Same small model, better input, big accuracy win.

A pre-trained model from someone else's distribution is a starting point, not a solution

The biggest single accuracy win in the project was replacing YuNet (a pre-trained face detector) with a custom MobileNetV3 trained on ~780 hand-labeled frames from my own bassinet. YuNet was detecting faces on 53% of baby-present frames; my custom detector hits 100% on the validation set. The table is in part 1.

The first version of my pipeline used YuNet because it was free and it was already there. Detection rate was 53% on baby-present frames and I spent days debugging the eye-state model before realizing it was getting empty crops half the time. The eye-state model wasn't broken — it was being fed garbage inputs from a face detector that had never seen a baby in IR.

The lesson is structural: if your target distribution is different from the one your pre-trained model was trained on, domain-specific data wins, even if your dataset is tiny. "Indoor IR night vision of a swaddled infant" is very different from "daytime portraits of adults." ~780 examples from the right distribution beat millions from the wrong one. The training distribution is the model, in a real sense — the weights are just how that distribution is stored.

Class imbalance is the silent killer of small ML projects

I tried adding a third class to the eye-state classifier — face_not_visible — for the cases where the baby was face-down against the mattress. Reasonable from a product standpoint. Disastrous from an ML standpoint:

Eye-state classifier	Classes	Examples of rarest class	Macro F1
2-class (kept)	eyes_open, eyes_closed	hundreds each	0.71 at the time (later 0.97)
3-class (deleted)	+ face_not_visible	~36	0.50

The class-weighted loss was so extreme that the model started predicting the rare class everywhere, dragging macro F1 from 0.71 down to 0.50. The fix was to delete the third class entirely and use a confidence threshold on the 2-class model: if the eye-state classifier isn't sure, fall back to "I don't know" and let the cloud API handle it. Cleaner, simpler, better metrics.

The lesson: don't add a class you can't reliably get 100+ examples of. Even sklearn's class_weight="balanced" can't save you when one class is 10× rarer than the others. Either gather more data or fold the rare class into "other" and handle it with a fallback path.

What went wrong

The truth is most of what I learned came from things going wrong. A few of the highlights:

The deepcopy bug. I selected the "best epoch" by validation loss and saved the weights with best_state = model.state_dict().copy(). PyTorch's state_dict().copy() is a shallow copy — the dict gets a fresh outer container, but the tensor values inside it are still references to the live Parameter.data. Every subsequent optimizer.step() silently mutated the "saved" snapshot. Eight model versions went into production with last-epoch weights instead of best-epoch weights. None of them were dramatically broken, but the validation metrics in my training log described models that were never deployed. The fix was copy.deepcopy(model.state_dict()). The cost was an entire morning and a permanent paranoia about what my training metrics actually mean.

The schema migration drift. Mid-project I renamed a database column from shadow_birdeye_state to shadow_birdeye_present. I migrated the table. I updated the writer code. I forgot to update one of the readers (the dashboard's safety stats endpoint) and it silently broke for two days because the dashboard's failure mode was "show stale data" rather than "throw an error." I noticed only when I checked the dashboard for an unrelated reason and the numbers looked frozen in time. The general lesson: every column rename is a contract change with every consumer of that column, and SQLite will not help you find them. I now lean toward keeping legacy column names forever and routing writes through a single helper that knows the truth.

Trusting cloud API labels as ground truth. Early on I treated GPT-4o's labels as the source of truth for backtest evaluation. I spot-checked them eventually and found about 5% of "asleep" labels were wrong — usually because the baby had momentarily closed their eyes during a partial shot. Training my model to match the cloud API exactly was teaching it to copy a 5% error rate. The fix was a strict ground-truth definition: only frames I'd manually reviewed or explicitly corrected count as evaluation truth. Raw cloud API labels are training signal, not ground truth. I cannot stress enough how easy it is to fool yourself on this if you don't write the rule down.

The geometric edge alert. This one is my favorite failure because the data was so unambiguous. I wanted a "baby is pressed against the bassinet side" alert but I didn't have a classifier for it, so I tried to derive it geometrically: if the face bounding box is in the bottom 30% of the bassinet crop AND presence confidence is high AND 2-of-3 recent frames also satisfy this, fire the alert. It seemed like a reasonable heuristic. I backtested it against 7 days of cloud API labels and got:

Metric	Result
Recall	79%
Precision	6%
False positives per true positive	~15

After parameter sweeping, the best operating point I could find was recall 52%, precision 11%. Still alarm fatigue. I deleted the heuristic and opened a GitHub issue for a proper trained classifier. The data-driven negative result is itself a decision; it's not failure, it's information.

The "head crop seeded from cloud API" idea. Before I had a face detector, I tried to seed the on-device model's head crop from the cloud API's reported head position on previous frames. Babies move. At the 4-minute capture cadence I was running back then, by the time the next frame arrived the "head crop" was often an empty pillow, the baby's hip, or a corner of the bassinet rail. I should have realized this from first principles — babies move on a much faster timescale than four minutes — but I had to actually look at the frames to see it. (This is one of the reasons I eventually dropped to a one-minute cadence and trained a proper face detector instead of faking it from stale coordinates.)

Why this is a great ML project for engineers

If you've never deployed an ML model end-to-end, I think a project like this beats almost anything else as a learning vehicle. Three reasons:

The ground truth is sitting next to you. Most ML problems have ambiguous labels — "is this comment toxic," "is this transaction fraud" — and you have to argue with reviewers about what's correct. With a baby monitor, when the model says "asleep," you can literally look across the room and verify it. Every prediction has an immediate, unambiguous oracle. That's a luxury you almost never have in production ML, and you should use it to build deep intuition for your model's failure modes.

The feedback loop is fast. A capture every minute means a new test case every minute. The dashboard correction workflow means every wrong prediction can become a training example in seconds. You can iterate on your data (which is where the wins are) at a speed you'd never get in a corporate ML environment with weekly retraining cycles.

You feel every tradeoff in your bones. When your model misses a wake event, you find out about it because the baby is crying and your phone is silent. That's a much more visceral teacher than a degraded F1 score in a Slack alert. Every threshold tuning decision is a real decision with a real cost. Every "should I add more training data or just deploy what I have" question has a clear answer because you're the user.

The other thing this project gives you, which I didn't expect, is a forcing function for engineering discipline. When your model is shipping to a customer that's literally your child, you don't take shortcuts. You write the validation rigor you should have been writing all along. You build the dashboard you should have been building all along. You learn to be paranoid about your training metrics, your label quality, your test set leakage, your deployment pipeline. None of this is glamorous, and none of it is taught well in tutorials. It's exactly the stuff that separates "I trained a model" from "I shipped a model."

If you've been wanting to "do an ML project" and you don't know what to build, build something whose ground truth is sitting in your house.

The role of coding agents

I want to be honest about something that would have been impossible to write three years ago: most of this project was built collaboratively with a coding agent. I am not going to pretend that I did this alone, because the current generation of coding agents is genuinely a different gear for hobby projects, and I think more people should know.

What an agent did better than I would have alone

Boilerplate at the speed of thought. Setting up the Flask dashboard, writing the SQLite schema, implementing the Telegram bot integration, adding launchd plists, wiring argparse flags into a CLI — that's all "sit down with the docs for an hour" work that the agent does in minutes. The bottleneck shifts from "how do I write this" to "what do I want this to do."

Parallel debugging investigations. When I hit the schema migration drift bug, the agent searched every consumer of the renamed column across the codebase in a single pass, found the dashboard endpoint that was still on the old name, and correlated it with the silently-broken response shape — in one shot. I'd have done this manually with three or four grep runs and probably gotten it wrong the first time.

The discipline I skip when tired. Writing thorough commit messages, checking that documentation matches code, sanity-checking my own reasoning by spot-checking the actual data — these are the things I cut corners on at midnight. The agent doesn't get tired. Asking it to "validate this assumption against the actual database" before I commit a fix saved me from shipping the geometric edge alert with bad numbers, more than once.

Honest framing of options. When I wanted to ship the geometric edge alert and was emotionally invested in it working, the agent ran the backtest, returned the precision/recall numbers without flinching, and told me the rule was too noisy to ship. That's the kind of pushback I need from a collaborator and the kind I'd never get from a traditional Stack Overflow answer.

What an agent is NOT good at, even now

Knowing when your validation data is bad. The agent will compute whatever metric you ask it for, against whatever data you point it at. It can't tell you that your "ground truth" is actually the same source as your training labels and your accuracy is therefore meaningless. That's a judgment call that needs a human in the loop. The cloud-API-labels-as-ground-truth mistake I made would have been just as invisible to the agent as it was to me until I explicitly said "spot-check these and see if you agree."

Knowing when to stop iterating. The agent will happily keep tuning hyperparameters or trying new architectures forever. The "this is good enough, ship it" decision is yours. It needs taste and product judgment that doesn't fit in a context window.

Telling you that the right answer is to delete your work. When the geometric edge alert backtest came back terrible, the agent did flag it — but I had to be the one to say "OK, delete this entire branch." The agent's bias is toward shipping the thing it just helped you build, not toward telling you the thing shouldn't exist. If you built it together, the agent feels sunk-cost in ways I didn't expect. Override it.

Disagreeing hard enough when you're wrong. Agents are calibrated to be helpful and collaborative, which is great most of the time and bad when you're heading off a cliff. If I said "let's try the 3-class eye-state classifier," the agent would help me try it. It wouldn't refuse. A more senior collaborator would have said "class imbalance is going to eat you — let's look at your label counts first" before writing a single line of code. I now explicitly prompt the agent to pushback when I'm proposing something and to list the reasons I might be wrong before it helps me do it.

The honest summary

A good coding agent in 2026 turns a "two-month side project" into a "two-week side project," but it doesn't change the thinking part of the work. You still have to know what to build, why to build it, and when it's actually working. What the agent saves you is the friction. That's a much bigger deal than it sounds, because friction is what kills hobby projects. I started a dozen ML hobby projects in the previous five years and shipped exactly zero. I started this one and shipped it because the friction of every individual step had dropped by an order of magnitude.

The most useful mental model I've landed on is this: the agent is a very good, very fast, slightly junior collaborator who will do anything you ask without complaint. That framing gets you the benefits (they're fast, they don't get tired, they write the boring parts for you) and warns you about the risks (they won't tell you your question is wrong, they won't tell you to stop, they won't tell you to delete it). If you keep both sides of that in mind, you'll get a lot out of one.

If you've been on the fence about a project like this because you don't want to fight Flask routes and argparse and sqlite3.OperationalError for a weekend before you even get to the interesting part, I have good news: you don't have to anymore. Spin up a coding agent, describe what you want, and you'll be looking at your camera feed in an hour. The thinking part is still on you — but that was always the fun part anyway.

Closing thought

I built a baby monitor on an old Mac because I didn't trust the commercial ones and I wanted to see if I could. Along the way I learned more about ML deployment than I had in five years of doing ML for a paycheck — that aggregate metrics lie, that ground truth is sacred, that bigger models are usually a way of avoiding harder questions, that pre-trained models from someone else's data distribution are starting points and not solutions, and that the boring parts of an ML project (the dashboard, the correction loop, the training data) are where almost all the value lives.

I also learned that the friction of starting a hobby project has dropped to near-zero, and I think that's the most important development for builders in the last decade. If you've been thinking about an ML project and haven't started it because you don't have the time, I promise the time you think it'll take is no longer the time it actually takes. Pick a thing you can verify the ground truth on by looking at it, point a camera at it, and start.

The baby is asleep right now. The dashboard says so. The model is 100% confident. I checked.