When my kid was born I started shopping for a baby monitor and got nervous about every single option.
The expensive ones — Nanit, Owlet, Cubo Ai — were essentially "give us your bassinet feed and we'll send you alerts on our app." Cloud-based. Vendor accounts. Subscription tiers. Servers in places I couldn't see, hosting frames of my newborn for retention windows nobody could quite explain. Half the reviews mentioned the company being acquired or pivoting.
The cheap ones were dumb cameras with proprietary apps that streamed unencrypted and got bricked when the manufacturer stopped pushing firmware. One product line I almost bought had been the subject of a CVE the previous summer.
I have an opinion about this category that turns out to be inconvenient: I don't want a baby's bassinet camera to leave my home network. Ever. For any reason. Not even to a "trusted" vendor.
So I built one. It's a $25 IP camera clamped to the bassinet rail, an old Intel Mac that was about to become e-waste, a few hundred lines of Python, and a small MobileNetV3-Small classifier that decides whether the baby's eyes are open. Total cost: about $50. Total network egress to the public internet during normal operation: zero bytes.
It also turned into a much longer rabbit hole than I expected. This is a write-up of what I built, the design choices I made, the things I tried that didn't work, and the things I learned about deploying ML models that I think transfer to almost any "make a small system that watches a small thing" hobby project.
What I built
The system is called BILBO (Baby Intelligent Lookout & Behavior Observer, because every weekend project needs a strained backronym).
It does four things:
-
Captures a frame from the bassinet camera every 4 minutes. A launchd job runs an
ffmpegcommand that pulls a single JPEG out of the camera's RTSP stream. No constant streaming, no buffer in someone else's data center. -
Classifies the frame with a 3-stage on-device ML pipeline. Is there a baby in the bassinet? Where is their face? Are their eyes open or closed? All three answers come from small CNNs running on the Mac's CPU.
-
Stores the result in a local SQLite database alongside the frame, the model's confidence scores, and a timestamp. Indexed for fast queries. JSONL backup for paranoia.
-
Pings me on Telegram when it thinks the baby is waking up. A simple "2-of-3 recent frames are Awake" rule. Snapshot included so I can see what the model saw.
There's also a Flask dashboard at localhost:5555 for me to scroll through the timeline, look at any frame, and correct labels the model got wrong. Corrections feed back into a retraining loop. The dashboard never leaves my LAN either — it lives at a private IP on a private port.
The whole thing is ~3,000 lines of Python and a few hundred lines of HTML/JS for the dashboard. Anyone with a free afternoon and an old Mac could build a version of it.
Why I built it (the design principles)
Four principles drove every choice:
Privacy, by architecture, not by promise. Cloud baby monitors are always one acquisition or one breach away from being a problem. I wanted a system whose security model was "the camera is on a private VLAN and the only thing reading it is a Python script on a machine I own." If my router goes down, the monitoring still works. If the Mac is offline, the camera is just a dumb camera. Nothing in the loop is rented from a vendor.
Cheap, by reusing what I had. I had a 2014 Mac mini sitting in a drawer with a perfectly good CPU and 8GB of RAM. It hadn't shipped a macOS update in three years but it could still run Python and ffmpeg fine. I bought a $25 TP-Link Tapo C100 IP camera and a $15 gooseneck microphone stand. That was the entire hardware investment. No Raspberry Pi, no Coral TPU, no NVIDIA Jetson, no smart home hub.
Simple, by doing less. The pipeline is a launchd job that runs every 4 minutes. There's no message queue, no worker pool, no Docker, no Kubernetes, no orchestration. If something breaks, the worst case is I miss one tick and try again 4 minutes later. The whole system can be killed with launchctl unload and restarted with launchctl load. I want to be able to debug this at 3am while running on six hours of sleep, which means I want the smallest possible amount of code between me and the camera.
Learning, by actually shipping the model. I'd done plenty of ML in notebooks. I'd never had to make a model robust enough to actually rely on. A baby monitor is the perfect forcing function: the ground truth is sleeping six feet away, the failure modes are emotionally painful, and there's no boss telling me a 90% F1 is good enough.
These four principles cascaded into nearly every later decision. When I was tempted to add a feature that needed the cloud, the privacy principle vetoed it. When I was tempted to over-engineer the orchestration, the simplicity principle vetoed it. When I was tempted to use someone else's pre-trained model and skip the hard parts, the learning principle vetoed it.
System architecture
IP camera ──RTSP──▶ ffmpeg ──JPEG──▶ Python pipeline
(Tapo C100) (frame grab) │
├──▶ pixel-diff (empty bassinet?)
├──▶ MobileNet #1: presence
├──▶ MobileNet #2: face detector
├──▶ MobileNet #3: eye state
│
▼
SQLite ◀────▶ Flask dashboard
│ (review + correct)
▼
wake check
│
▼
Telegram alert
Five components:
- The camera is a $25 TP-Link Tapo C100. It supports RTSP, has IR night vision, and lives on a private network segment with no internet access. The Mac talks to it directly over the LAN.
ffmpegis the pixel-grabbing primitive. A singleffmpeg -rtsp_transport tcp -i <url> -frames:v 1 frame.jpgcall grabs one frame and exits. Runs in about 2 seconds including connection setup.- The Python pipeline does pixel-diff to skip empty frames, then runs the 3-stage classifier cascade on anything that's changed. It's a single
monitor.pyscript invoked bylaunchdevery 4 minutes. No long-running process. - SQLite stores every frame's metadata: timestamp, classifier outputs, confidence scores, the path to the JPEG on disk. WAL mode for safe concurrent reads from the dashboard.
- The Flask dashboard is a separate
launchdservice that serves a single-page UI for reviewing the timeline and correcting model mistakes. Bound to127.0.0.1:5555— never exposed to the LAN, let alone the internet.
There's nothing here you couldn't draw on a napkin. That's the point.
The ML system
The interesting part is the classifier cascade. The naive version of "is the baby awake" is a single binary classifier on the full frame. That doesn't work — at the resolution of the bassinet crop, the baby's eyes are about 8 pixels tall. A small CNN has nothing to learn from.
What works is breaking the question into three smaller questions, each handled by a model that's allowed to specialize:
-
Is there a baby in the bassinet? Input: a fixed crop of the bassinet center. Output:
presentornot_present. A binary classifier on the wide bassinet view doesn't need to see eyes — it just needs to see "is there a baby-shaped thing in this rectangle." MobileNetV3-Small handles this trivially: macro F1 0.99 against my reviewed-and-corrected ground truth. -
Where is the face? Input: the same bassinet crop. Output: a bounding box around the face, or "no face found." This is the only stage that needs spatial reasoning. I trained a custom MobileNetV3-Small with a regression head that outputs
(x1, y1, x2, y2, confidence). About 780 hand-corrected bbox annotations were enough for it to hit 100% detection rate on baby-present frames in the validation set. -
Are the eyes open? Input: a tight crop around the face from stage 2. Output:
eyes_openoreyes_closed. Now the eyes are ~40 pixels tall instead of 8, and a small CNN has plenty of signal. Macro F1 0.91.
All three stages are MobileNetV3-Small models. ImageNet-pretrained backbones, fine-tuned on data from my own bassinet. The whole cascade runs in 80-100ms on CPU. No GPU. No accelerator hardware.
The cost story falls out of this for free. Before flipping the cascade to primary, I was sending every non-empty frame to GPT-4o, which charged about $0.01 per call — about $35/month of cloud spend on the question "are the baby's eyes open." After the flip, the cloud API runs only when the on-device cascade can't see a face (about 1% of frames). My OpenAI bill for this project is now about 30 cents a month, mostly there as a backstop for face-occluded edge cases.
The training data was the unglamorous half of the work. I bootstrapped from labels generated by GPT-4o calls during the "shadow mode" period (more on that below), then manually reviewed and corrected ~700 frames through the dashboard. The label priority pipeline is human correction > human review > cloud API output. I never let the cloud API's labels train the model directly without a human in the loop, because that would just teach the model to copy the cloud API's mistakes.
Tradeoffs and lessons learned
This is the section I most wish someone had written before I started. Real ML projects are 10% choosing models and 90% choosing tradeoffs. Here are the ones that mattered.
Precision vs recall is not a single dial — it's a different dial per alert
The temptation is to pick a confidence threshold for the model, run it through the precision-recall curve, and find the "best" point. That's wrong. Different alerts need different operating points.
For wake detection (the Telegram ping when the baby seems to be waking up), I weight recall heavily. Missing a real wake event is much worse than getting a false alarm — I'd rather check the dashboard once for nothing than not check it when the baby is crying. So the wake alert is gated on a relatively loose 2-of-3 rule.
For an edge alert I tried later (more on that in a minute), I weight precision heavily. False alarms degrade trust in the entire system — a baby monitor that pings me for nothing twice a day will eventually be muted, and once it's muted it might as well not exist. So that alert needs precision near 1.0 even if recall suffers.
The lesson: don't think about "the model's threshold." Think about each downstream alert's cost matrix and pick the threshold that fits. The model is the same; the threshold is the policy.
The model gets the obvious cases. The edge cases are 100% of what matters
My first few classifiers had numbers that looked great in aggregate. 92% accuracy! 90% F1! Then I'd look at where they were wrong, and they were always wrong on:
- Partially closed eyes (the baby blinking, half-asleep)
- Faces in deep shadow (one side of the face dark, other side lit)
- Motion blur (baby just moved their head)
- Profile views (baby turned 90 degrees, only one eye visible)
- Unusual angles (baby flopped sideways with face partially against the mattress)
These are the cases that matter most. The "easy" cases — eyes wide open looking at the ceiling, eyes squeezed shut while sleeping — are where the cloud API agreed with my model 100% of the time. They were wasted training signal. The hard cases were the actual product.
The lesson here is something I now believe deeply: aggregate metrics lie about hobby ML projects. A confusion matrix on a held-out test set is fine for a paper, but the test set is biased toward the easy cases by construction. The only way to actually know how your model is doing is to look at every frame it got wrong and ask "would I have gotten this right if I were squinting at it?"
Performance vs accuracy is usually a fake tradeoff
I spent a couple of evenings testing whether MobileNetV3-Large or 384×384 input resolution would help. Both are heavier — 2-3× the latency — and both promise more capacity for the model.
Neither moved the needle on my dataset. The bottleneck wasn't model capacity, it was input resolution at the eye level. Once I added the face detector and the eye-state model was looking at a tight 40×40 face crop instead of the wide bassinet, the small model with the small input was already sufficient. Throwing more compute at a problem the small model could already solve was a waste.
The lesson: when accuracy is plateauing, the answer is almost never "bigger model." The answer is usually "give the model better inputs." Crop tighter. Train on the actual hard cases. Add a preprocessing step. Bigger models are the most expensive way to fix a data problem.
Night vision is a separate problem
The TP-Link camera flips into IR mode after dark, which produces grayscale frames with completely different texture and contrast statistics than the daytime color frames. Models trained primarily on daytime frames fall off a cliff at night.
The first version of my pipeline used YuNet, an open-source face detector trained on adult daytime portraits. It found faces on 53% of baby-present frames. I assumed the bottleneck was something else and spent days debugging the eye-state model before realizing the eye-state model was getting empty crops half the time.
The eventual fix was training my own face detector on the actual mix of day and night frames from my own camera. Detection rate jumped from 53% to 100% on the validation set. Ten hours of bbox annotation work; one Saturday afternoon of training; biggest single accuracy win in the project.
The general lesson: a pre-trained model from someone else's data distribution is a starting point, not a solution. If your distribution is different (and "indoor IR night vision of an infant" is very different from "daytime portraits of adults"), domain-specific data wins, even if it's a small dataset. My trainable detector beat YuNet despite YuNet having literally millions of training examples.
Class imbalance is the silent killer of small ML projects
I tried adding a third class to the eye-state classifier — face_not_visible — for the cases where the baby was face-down against the mattress. Reasonable from a product standpoint. Disastrous from an ML standpoint: I had ~36 examples of face_not_visible against hundreds of eyes_open and eyes_closed. The class-weighted loss was so extreme that the model started predicting the rare class everywhere, dragging macro F1 from 71% (2-class) to 50% (3-class).
The fix was to delete the third class entirely and use a confidence threshold on the 2-class model: if the eye-state classifier isn't sure, fall back to "I don't know" and let the cloud API handle it. Cleaner, simpler, better metrics.
The lesson: don't add a class you can't reliably get 100+ examples of. Even sklearn's class_weight="balanced" can't save you when one class is 10× rarer than the others. Either gather more data or fold the rare class into "other" and handle it with a fallback path.
What went wrong
The truth is most of what I learned came from things going wrong. A few of the highlights:
The deepcopy bug. I selected the "best epoch" by validation loss and saved the weights with best_state = model.state_dict().copy(). PyTorch's state_dict().copy() is a shallow copy — the dict gets a fresh outer container, but the tensor values inside it are still references to the live Parameter.data. Every subsequent optimizer.step() silently mutated the "saved" snapshot. Eight model versions went into production with last-epoch weights instead of best-epoch weights. None of them were dramatically broken, but the validation metrics in my training log described models that were never deployed. The fix was copy.deepcopy(model.state_dict()). The cost was an entire morning and a permanent paranoia about what my training metrics actually mean.
The schema migration drift. Mid-project I renamed a database column from shadow_birdeye_state to shadow_birdeye_present. I migrated the table. I updated the writer code. I forgot to update one of the readers (the dashboard's safety stats endpoint) and it silently broke for two days because the dashboard's failure mode was "show stale data" rather than "throw an error." I noticed only when I checked the dashboard for an unrelated reason and the numbers looked frozen in time. The general lesson: every column rename is a contract change with every consumer of that column, and SQLite will not help you find them. I now lean toward keeping legacy column names forever and routing writes through a single helper that knows the truth.
Trusting cloud API labels as ground truth. Early on I treated GPT-4o's labels as the source of truth for backtest evaluation. I spot-checked them eventually and found about 5% of "asleep" labels were wrong — usually because the baby had momentarily closed their eyes during a partial shot. Training my model to match the cloud API exactly was teaching it to copy a 5% error rate. The fix was a strict ground-truth definition: only frames I'd manually reviewed or explicitly corrected count as evaluation truth. Raw cloud API labels are training signal, not ground truth. I cannot stress enough how easy it is to fool yourself on this if you don't write the rule down.
The geometric edge alert. This one is my favorite failure because the data was so unambiguous. I wanted a "baby is pressed against the bassinet side" alert but I didn't have a classifier for it, so I tried to derive it geometrically: if the face bounding box is in the bottom 30% of the bassinet crop AND presence confidence is high AND 2-of-3 recent frames also satisfy this, fire the alert. It seemed like a reasonable heuristic. I backtested it against 7 days of cloud API labels and got recall 79%, precision 6%. Almost 7 false positives for every true positive. After parameter sweeping, the best operating point I could find was recall 52%, precision 11%. Still alarm fatigue. I deleted the heuristic and opened a GitHub issue for a proper trained classifier. The data-driven negative result is itself a decision; it's not failure, it's information.
The "head crop seeded from cloud API" idea. Before I had a face detector, I tried to seed BIRDEYE's head crop from the cloud API's reported head position on previous frames. Babies move. By the time the next 4-minute capture arrived, the "head crop" was often a crop of an empty pillow, the baby's hip, or a corner of the bassinet rail. I should have realized this from first principles — babies move at much faster than a 4-minute timescale — but I had to actually look at the frames to see it.
Implementation details (light)
The full stack:
- Python 3.12 (PyTorch's only constraint is
<= 3.13so I'm locked here for now) - PyTorch + torchvision for the MobileNetV3-Small models. No custom layers, no exotic architectures, just
torchvision.models.mobilenet_v3_small(pretrained=True)with a swapped final layer. - OpenCV (
opencv-python-headless) for image I/O and the bassinet crop. Headless because I don't need GUI windows. ffmpegfor the actual RTSP frame grab. PyAV would be more "Pythonic" butffmpegis already on every Mac and the cost of one process spawn every 4 minutes is invisible.- SQLite + WAL mode for storage. ~6ms for a 24-hour timeline query. No ORM, just
sqlite3.connect()and parameterized queries. - Flask for the dashboard. No frontend framework, no build step, no
node_modules. The whole UI is one HTML file with vanilla JavaScript hitting JSON endpoints. launchdfor scheduling. One.plistfor the capture job (every 4 minutes), one for the dashboard (persistent), one for the daily retrain (manually triggered now). No cron, no systemd, no Airflow.python-telegram-botfor the alerts. Bot token in a.envfile that'schmod 600and gitignored.
Total dependencies: about 12 packages. Total disk footprint of the venv: ~1.5GB (PyTorch is the heavy hitter). Total lines of Python in the project: ~3,000.
You could swap any of these out without restructuring the project. If you're on Linux, replace launchd with systemd. If you're on a Pi, replace MobileNetV3 with whatever runs on your accelerator. The architecture is independent of the implementation choices — that's intentional.
What I would improve
Knowing what I know now, the next round of improvements would target the failure modes I haven't fixed yet:
Temporal smoothing. Right now each frame is classified independently. A 3-frame median filter on the eye-state output would catch most of the spurious blinks and motion-blur false positives. The data is already in the database; this is purely a query change.
Multi-modal sensing. An IP camera with a microphone (or just a separate USB mic) gives me an audio stream I could run a small wake-detection classifier on. Audio reacts faster than video for a crying baby, and the two signals are independent enough that combining them via simple ensemble would meaningfully improve recall on real wake events.
Better night-vision data. I have ~5× more daytime frames than night frames, mostly because the dashboard correction workflow is naturally biased toward "frames I happened to review during the day." A targeted active-learning loop — surface the night frames with the lowest confidence and ask me to label them — would close the gap fast.
More robust face detection in occluded poses. My trainable face detector is at 100% on the validation set but I know it fails on certain occluded poses (face turned 80° away from camera, half-buried in swaddle). The training data doesn't cover those well because they're rare. An augmentation pipeline that simulates partial occlusion would help.
A trained BassinetLocationClassifier — the proper version of the geometric edge alert that didn't work. Same MobileNetV3-Small architecture as the others, binary pressed_against_side / not_pressed, bootstrapped from the cloud API's position labels. I have an issue tracking it; one Saturday's work to actually do it.
Notice that none of these are "make the model bigger" or "use a better architecture." They're all about giving the model better inputs or smoothing its outputs. That's where the wins come from at this scale.
Why this is a great ML project for engineers
If you've never deployed an ML model end-to-end, I think a project like this beats almost anything else as a learning vehicle. Three reasons:
The ground truth is sitting next to you. Most ML problems have ambiguous labels — "is this comment toxic," "is this transaction fraud" — and you have to argue with reviewers about what's correct. With a baby monitor, when the model says "asleep," you can literally look across the room and verify it. Every prediction has an immediate, unambiguous oracle. That's a luxury you almost never have in production ML, and you should use it to build deep intuition for your model's failure modes.
The feedback loop is fast. A capture every 4 minutes means a new test case every 4 minutes. The dashboard correction workflow means every wrong prediction can become a training example in seconds. You can iterate on your data (which is where the wins are) at a speed you'd never get in a corporate ML environment with weekly retraining cycles.
You feel every tradeoff in your bones. When your model misses a wake event, you find out about it because the baby is crying and your phone is silent. That's a much more visceral teacher than a degraded F1 score in a Slack alert. Every threshold tuning decision is a real decision with a real cost. Every "should I add more training data or just deploy what I have" question has a clear answer because you're the user.
The other thing this project gives you, which I didn't expect, is a forcing function for engineering discipline. When your model is shipping to a customer that's literally your child, you don't take shortcuts. You write the validation rigor you should have been writing all along. You build the dashboard you should have been building all along. You learn to be paranoid about your training metrics, your label quality, your test set leakage, your deployment pipeline. None of this is glamorous, and none of it is taught well in tutorials. It's exactly the stuff that separates "I trained a model" from "I shipped a model."
If you've been wanting to "do an ML project" and you don't know what to build, build something whose ground truth is sitting in your house.
The role of coding agents
I want to be honest about something that would have been impossible to write three years ago: most of this project was built collaboratively with a coding agent. I am not going to pretend that I did this alone, because the current generation of coding agents is genuinely a different gear for hobby projects, and I think more people should know.
Here is what an agent did better than I would have, working alone:
Boilerplate at the speed of thought. Setting up the Flask dashboard, writing the SQLite schema, implementing the Telegram bot integration, adding launchd plists, wiring argparse flags into a CLI — that's all "sit down with the docs for an hour" work that the agent does in minutes. The bottleneck shifts from "how do I write this" to "what do I want this to do."
Parallel debugging investigations. When I hit the schema migration drift bug, the agent searched every consumer of the renamed column across the codebase in a single pass, found the dashboard endpoint that was still on the old name, and correlated it with the silently-broken response shape — in one shot. I'd have done this manually with three or four grep runs and probably gotten it wrong the first time.
The discipline I skip when tired. Writing thorough commit messages, checking that documentation matches code, sanity-checking my own reasoning by spot-checking the actual data — these are the things I cut corners on at midnight. The agent doesn't get tired. Asking it to "validate this assumption against the actual database" before I commit a fix saved me from shipping the geometric edge alert with bad numbers, more than once.
Honest framing of options. When I wanted to ship the geometric edge alert and was emotionally invested in it working, the agent ran the backtest, returned the precision/recall numbers without flinching, and told me the rule was too noisy to ship. That's the kind of pushback I need from a collaborator and the kind I'd never get from a traditional Stack Overflow answer.
Here is what an agent is NOT good at, even now:
Knowing when your validation data is bad. The agent will compute whatever metric you ask it for, against whatever data you point it at. It can't tell you that your "ground truth" is actually the same source as your training labels and your accuracy is therefore meaningless. That's a judgment call that needs a human in the loop.
Knowing when to stop iterating. The agent will happily keep tuning hyperparameters or trying new architectures forever. The "this is good enough, ship it" decision is yours. It needs taste and product judgment that doesn't fit in a context window.
Telling you that the right answer is to delete your work. When the geometric edge alert backtest came back terrible, the agent did flag it — but I had to be the one to say "OK, delete this entire branch." The agent's bias is toward shipping the thing it just helped you build, not toward telling you the thing shouldn't exist.
The honest summary is: a good coding agent in 2026 turns a "two-month side project" into a "two-week side project," but it doesn't change the thinking part of the work. You still have to know what to build, why to build it, and when it's actually working. What the agent saves you is the friction. That's a much bigger deal than it sounds, because friction is what kills hobby projects. I started a dozen ML hobby projects in the previous five years and shipped exactly zero. I started this one and shipped it because the friction of every individual step had dropped by an order of magnitude.
If you've been on the fence about a project like this because you don't want to fight Flask routes and argparse and sqlite3.OperationalError for a weekend before you even get to the interesting part, I have good news: you don't have to anymore. Spin up a coding agent, describe what you want, and you'll be looking at your camera feed in an hour.
Conclusion
I built a baby monitor on an old Mac because I didn't trust the commercial ones and I wanted to see if I could. It cost about $50 in hardware. It runs entirely on my home network. It uses a small ML pipeline that I trained on data I labeled myself. It handles ~99% of frames without ever hitting the public internet. The cloud API runs as a fallback for the ~1% of frames the on-device cascade can't handle, costing me about a penny a day.
Along the way I learned more about ML deployment than I had in five years of doing ML for a paycheck. I learned that aggregate metrics lie, that ground truth is sacred, that bigger models are usually a way of avoiding harder questions, that pre-trained models from someone else's data distribution are starting points and not solutions, and that the boring parts of an ML project (the dashboard, the correction loop, the training data) are where almost all the value lives.
I also learned that the friction of starting a hobby project has dropped to near-zero, and I think that's the most important development for builders in the last decade. If you've been thinking about an ML project and haven't started it because you don't have the time, I promise the time you think it'll take is no longer the time it actually takes. Pick a thing you can verify the ground truth on by looking at it, point a camera at it, and start.
The baby is asleep right now. The dashboard says so. The model is 100% confident. I checked.