Building a Privacy-First Baby Monitor on an Old Intel Mac

This is part 1 of a two-part series. Part 1 (this post) is the build: what BILBO is, how it works, and the tradeoffs with data behind them. Part 2 is the meta: the ML lessons, the bugs that cost me a morning each, and what working with a coding agent actually felt like.

Even before my child was born, I started looking into baby monitors in the market, and got a bit nervous with my options.

The expensive ones — Nanit, Owlet, Cubo Ai — were essentially "give us your bassinet feed and we'll send you alerts on our app." Cloud-based. Vendor accounts. Subscription tiers. Servers in places I couldn't see, hosting frames of my newborn for retention windows nobody could quite explain. Half the reviews mentioned the company being acquired or pivoting.

The cheap ones were dumb cameras with proprietary apps that streamed unencrypted and got bricked when the manufacturer stopped pushing firmware. One product line I almost bought had been the subject of a CVE the previous summer.

I have an opinion about this category that turns out to be inconvenient: I don't want a baby's bassinet camera to leave my home network. Ever. For any reason. Not even to a "trusted" vendor.

So I built one. It's a $25 IP camera clamped to the bassinet rail, an old Intel Mac that was about to become e-waste, a few hundred lines of Python, and a small MobileNetV3-Small classifier that decides whether the baby's eyes are open. Total cost: about $50. Total video frames sent to a third party during normal operation: zero. (I do run a Cloudflare tunnel for dashboard access from my phone, which I'll get into — but no baby imagery ever leaves the Mac.)


What I built

The system is called BILBO (Baby Intelligent Lookout & Behavior Observer, because every weekend project needs a strained backronym).

It does four things:

  1. Captures a frame from the bassinet camera once a minute. A launchd job runs an ffmpeg command that pulls a single JPEG out of the camera's RTSP stream. No constant streaming, no buffer in someone else's data center.

  2. Classifies the frame with a 3-stage on-device ML pipeline. Is there a baby in the bassinet? Where is their face? Are their eyes open or closed? All three answers come from small CNNs running on the Mac's CPU.

  3. Stores the result in a local SQLite database alongside the frame, the model's confidence scores, and a timestamp. Indexed for fast queries. JSONL backup for paranoia.

  4. Pings me on Telegram when it thinks the baby is waking up. A simple "2-of-3 recent frames are Awake" rule. Snapshot included so I can see what the model saw.

There's also a Flask dashboard at localhost:5555 for me to scroll through the timeline, look at any frame, and correct labels the model got wrong. Corrections feed back into a retraining loop. The dashboard itself is bound to localhost only — the only way to reach it from outside the Mac is through a Cloudflare tunnel that originates on the Mac as an outbound connection. No inbound ports on my router are ever opened, and the camera frames themselves never leave the Mac at all. More on the tunnel security model in a minute.

The whole thing is ~3,000 lines of Python and a few hundred lines of HTML/JS for the dashboard. Anyone with a free afternoon and an old Mac could build a version of it.


Why I built it (the design principles)

Four principles drove every choice:

Privacy, by architecture, not by promise. Cloud baby monitors are always one acquisition or one breach away from being a problem. I wanted a system whose security model was "the camera is on a private VLAN and the only thing reading it is a Python script on a machine I own." If my router goes down, the monitoring still works. If the Mac is offline, the camera is just a dumb camera. Nothing in the loop is rented from a vendor.

Cheap, by reusing what I had. I had a 2014 Mac mini sitting in a drawer with a perfectly good CPU and 8GB of RAM. It hadn't shipped a macOS update in three years but it could still run Python and ffmpeg fine. I bought a $25 TP-Link Tapo C100 IP camera and a $15 gooseneck microphone stand. That was the entire hardware investment. No Raspberry Pi, no Coral TPU, no NVIDIA Jetson, no smart home hub.

Simple, by doing less. The pipeline is a launchd job that runs once a minute. There's no message queue, no worker pool, no Docker, no Kubernetes, no orchestration. If something breaks, the worst case is I miss one tick and try again a minute later. The whole system can be killed with launchctl unload and restarted with launchctl load. I want to be able to debug this at 3am while running on six hours of sleep, which means I want the smallest possible amount of code between me and the camera.

Learning, by actually shipping the model. I'd done plenty of ML in notebooks. I'd never had to make a model robust enough to actually rely on. A baby monitor is the perfect forcing function: the ground truth is sleeping six feet away, the failure modes are emotionally painful, and there's no boss telling me a 90% F1 is good enough.

These four principles cascaded into nearly every later decision. When I was tempted to add a feature that needed the cloud, the privacy principle vetoed it. When I was tempted to over-engineer the orchestration, the simplicity principle vetoed it. When I was tempted to use someone else's pre-trained model and skip the hard parts, the learning principle vetoed it.


System architecture

BILBO system architecture: a local lane with camera, ffmpeg, Python pipeline, the 3-stage MobileNet cascade, SQLite, wake check and Telegram alert; plus a remote-access lane with phone, Cloudflare edge, cloudflared, and the Flask dashboard, separated by a dashed divider to emphasize that no inbound ports are opened.

Two lanes in that diagram matter:

  • The local lane (left) is where every frame lives. A frame is captured, classified, stored, and acted on without ever leaving the Mac. The cloud API is a dashed side-branch off the cascade, reached only when BIRDEYE says "low confidence" or "no face detected" — roughly 1% of frames.
  • The remote-access lane (right) exists so I can check the dashboard from my phone. It's intentionally drawn as a separate lane because none of its boxes ever see raw camera frames: they carry dashboard HTML and JSON, nothing else. Every arrow in that lane points in a direction that doesn't require an inbound port on my router.

Five components in the local lane:

  • The camera is a $25 TP-Link Tapo C100. It supports RTSP, has IR night vision, and lives on a private network segment with no internet access. The Mac talks to it directly over the LAN.
  • ffmpeg is the pixel-grabbing primitive. A single ffmpeg -rtsp_transport tcp -i <url> -frames:v 1 frame.jpg call grabs one frame and exits. Runs in about 2 seconds including connection setup.
  • The Python pipeline does pixel-diff to skip empty frames, then runs the 3-stage classifier cascade on anything that's changed. It's a single monitor.py script invoked by launchd once a minute. No long-running process.
  • SQLite stores every frame's metadata: timestamp, classifier outputs, confidence scores, the path to the JPEG on disk. WAL mode for safe concurrent reads from the dashboard.
  • The Flask dashboard is a separate launchd service that serves a single-page UI for reviewing the timeline and correcting model mistakes. Bound to 127.0.0.1:5555 — not exposed to the LAN. Remote access from my phone goes through a Cloudflare tunnel (covered in the next section) so I never open an inbound port on my router.

There's nothing here you couldn't draw on a napkin. That's the point.

Reaching the dashboard from outside my LAN

One thing I wanted but couldn't get from a pure localhost-only design: check the dashboard from my phone when I'm at the grocery store or at work. The naive version of this is "forward port 5555 on the router and hit my public IP." That's the single biggest footgun in a home-hosted setup, and the whole privacy argument falls apart the moment there's a listening socket on a public IP address.

The fix is a Cloudflare tunnel. A small daemon (cloudflared) runs on the Mac and opens a single persistent outbound TLS connection to Cloudflare's edge. When I hit the dashboard's subdomain from my phone, Cloudflare terminates TLS at their edge, authenticates me with Cloudflare Access (Google SSO + a one-time email code), and then proxies the request back through the already-open tunnel to the Flask app bound to 127.0.0.1:5555 on the Mac. No inbound ports on my network are ever opened. My router's firewall stays completely closed to the world.

The security model has three layers:

  1. No direct exposure. There is no listening socket on my public IP. A port scan of my home network finds nothing. Anything reaching my Mac has to have come through the tunnel, which means it had to clear Cloudflare first.
  2. Encryption end-to-end. The phone-to-Cloudflare hop is TLS (standard HTTPS). The Cloudflare-to-Mac hop is the tunnel, also encrypted. At no point on the wire is the dashboard traffic in clear text.
  3. Cloudflare's edge protections in front of everything. Rate limiting, bot filtering, a WAF, and Cloudflare Access as the auth gate all sit between the public internet and my Flask app. I'm not asking my ~200-line Flask app to survive contact with the open internet alone.

What I find worth saying out loud is that this is a dramatically smaller thing to secure than what a commercial baby monitor does. A cloud baby monitor ships a continuous stream of video frames outbound to a vendor-run ingest service — 24/7, every frame, indefinitely, sitting in someone else's object storage with whatever retention policy their TOS happens to promise this quarter. BILBO's tunnel, by contrast, only carries bytes when I'm actively looking at the dashboard, and those bytes are a few hundred KB of HTML and JSON plus whatever cropped stills I'm currently reviewing. The attack surface is one authenticated HTTP endpoint behind an SSO wall, not a perpetual video firehose to a third-party data center.

If my Cloudflare account were compromised tomorrow, the worst case is an attacker sees the dashboard — cropped stills with model labels, no live feed, no audio, nothing older than the frame retention window. If a commercial baby monitor vendor is compromised tomorrow, the worst case is an attacker gets years of full-resolution video of every child whose parents bought the product. Those are not comparable blast radii, and I don't think the industry talks about the difference honestly enough.


The ML system

The interesting part is the classifier cascade. The naive version of "is the baby awake" is a single binary classifier on the full frame. That doesn't work — at the resolution of the bassinet crop, the baby's eyes are about 8 pixels tall. A small CNN has nothing to learn from.

What works is breaking the question into three smaller questions, each handled by a model that's allowed to specialize:

  1. Is there a baby in the bassinet? Input: a fixed crop of the bassinet center. Output: present or not_present. A binary classifier on the wide bassinet view doesn't need to see eyes — it just needs to see "is there a baby-shaped thing in this rectangle." MobileNetV3-Small handles this trivially: macro F1 0.99 against my reviewed-and-corrected ground truth.

  2. Where is the face? Input: the same bassinet crop. Output: a bounding box around the face, or "no face found." This is the only stage that needs spatial reasoning. I trained a custom MobileNetV3-Small with a regression head that outputs (x1, y1, x2, y2, confidence). About 780 hand-corrected bbox annotations were enough for it to hit 100% detection rate on baby-present frames in the validation set.

  3. Are the eyes open? Input: a tight crop around the face from stage 2. Output: eyes_open or eyes_closed. Now the eyes are ~40 pixels tall instead of 8, and a small CNN has plenty of signal. Macro F1 0.91.

All three stages are MobileNetV3-Small models. ImageNet-pretrained backbones, fine-tuned on data from my own bassinet. The whole cascade runs in 80–100 ms on CPU. No GPU. No accelerator hardware.

The training data was the unglamorous half of the work. I bootstrapped from labels generated by GPT-4o calls during the "shadow mode" period (when the cloud API was the authority and the on-device models were running in parallel), then manually reviewed and corrected ~700 frames through the dashboard. The label priority pipeline is human correction > human review > cloud API output. I never let the cloud API's labels train the model directly without a human in the loop, because that would just teach the model to copy the cloud API's mistakes.


Tradeoffs with numbers

Most of what I've read about hobby ML is qualitative — "I tried the bigger model and it didn't help," "the pre-trained detector didn't work." I found that unsatisfying when I was trying to decide what to try myself, so here are the actual numbers from BILBO's history for the decisions that moved (or didn't move) the needle.

Hardware cost

ItemCostNotes
TP-Link Tapo C100 IP camera$25RTSP, 1080p, IR night vision
Gooseneck mic stand (mount)$15Clamps to the bassinet rail
2014 Mac mini$0Already owned; would otherwise be e-waste
Python, PyTorch, ffmpeg, SQLite, Flask$0All open source
Total~$40One-time; no recurring

The most important entry in that table is the $0 next to "Mac mini." If you had to buy a computer for this, the economics shift — though even a fresh $150 Mac mini would amortize against a few months of a commercial baby monitor subscription, and then every month after that is free.

Cloud-primary vs. on-device-primary

Before I had the cascade working, every non-empty frame went to GPT-4o — the on-device models existed, but the cloud API was authoritative. Flipping the pipeline so BIRDEYE decides first and the cloud API only runs as a fallback on the frames the cascade can't handle changed the operating characteristics of the whole system:

MetricCloud-primary (old)On-device-primary (current)
Cloud API calls per day~150 (every non-empty frame)~28 (BIRDEYE fallback only, first day post-flip)
Cloud spend per month (ballpark at ~$0.01/call)~$45~$8 projected, dropping as BIRDEYE stabilizes
Median decision latency~1,200 ms (network RTT + model)~80–100 ms (CPU only)
Network egress containing baby imageryEvery non-empty frame~5% of frames (early)
Works if home internet is downNoYes
Works if OpenAI has an outageNoYes (cloud fallback degrades to "low confidence")

All the numbers in that table are pulled straight from monitor.db. The "post-flip" column is from the first full day after cutting over — I'm watching it for a week before I claim a steady-state figure, but the order-of-magnitude story is already clear: call volume fell from ~150/day to ~28/day, which is roughly a 5× drop immediately, and should converge toward ~1% of frames (the BIRDEYE fallback rate) as I iron out the last few classes of inputs the cascade doesn't handle yet.

The cost delta is the headline, but the reliability deltas are the bigger wins. Before the flip, an OpenAI outage would make my baby monitor go dark. After the flip, it keeps running — the cloud fallback degrades gracefully into a "low confidence" marker on the small fraction of frames it would have been asked about.

Here's the actual per-day call volume pulled straight from monitor.db:

Stacked bar chart of inference calls per day from April 3 to April 13, 2026. Cloud API (orange) calls dominate early with ~150-200 per day, then on-device BIRDEYE (green) takes over on April 13 with 479 local calls versus 28 cloud calls, marked by a dashed vertical line labeled 'BIRDEYE flipped to primary'.

You can see the flip in the last bar on the right — local (green) dominates and cloud (orange) is a thin stub. The few cloud calls that remain on that day are the real operating cost of the new system: BIRDEYE's fallback path, fired when no face is visible or eye-state confidence is below threshold. That's the ~1% figure from the table above, in its raw form.

Did a bigger model help? (No.)

I spent a couple of evenings testing whether more model capacity or higher input resolution would move the eye-state metric. They didn't:

ConfigurationParametersPer-frame latencyEye-state macro F1
MobileNetV3-Small, 224×224 (chosen)~2.5M~40 ms0.97
MobileNetV3-Large, 224×224~5.4M~2× slowerno measurable gain
MobileNetV3-Small, 384×384~2.5M~2× slowerno measurable gain
MobileNetV3-Large, 384×384~5.4M~3× slowerno measurable gain

The ceiling wasn't the model — it was the pixels. Once I added a face detector and the eye-state model was looking at a tight ~40-pixel face crop instead of an 8-pixel-tall eye region on the full bassinet view, the small model had all the signal it needed. A bigger model staring at the same wide crop would have learned the same thing with more parameters.

The general lesson from this table shows up in part 2. For now: if you're tempted to scale up the model, check whether you could just crop tighter first.

Pre-trained face detector vs. one trained on my own data

This was the biggest single accuracy win in the project — and the least expected:

MetricYuNet (pre-trained ONNX)Custom MobileNetV3-Small
Training dataMillions of adult daytime portraits~780 hand-labeled frames from my own bassinet
Detection rate on baby-present frames53%100% (validation set)
Night vision (IR grayscale)Falls off a cliffMatches daytime
Single-frame latency~10 ms~40 ms
Effort to produce0 (download + load)~10h bbox labeling + one afternoon training

YuNet had every structural advantage — a much larger training set, a mature architecture, years of tuning from a team smarter than me. It lost anyway because its training distribution (adult daytime portraits) was wrong for my task (IR night footage of a swaddled infant). A one-afternoon model with ~780 examples from the right distribution beat a model trained on millions of examples from the wrong one.

Per-stage classifier results

For completeness, here are the final metrics for each stage of the cascade against my reviewed-and-corrected ground truth:

StageModelInputOutputMetric (on validation set)
1. PresenceMobileNetV3-SmallBassinet croppresent / not_presentmacro F1 ~1.00 (176 val frames)
2. Face detectionMobileNetV3-Small (regression head)Bassinet cropbbox or no_face100% detection rate on present frames
3. Eye stateMobileNetV3-Small~40×40 face crop from stage 2eyes_open / eyes_closedmacro F1 0.97 (87 val frames)

The whole cascade runs in 80–100 ms on the Mac mini's CPU. No GPU, no accelerator, no quantization. The bottleneck in the end-to-end tick is the ffmpeg RTSP grab, not the model inference.

The eye-state F1 number is the interesting one, because it hides a sample-size story that the aggregate metric doesn't:

Confusion matrix for the eye-state classifier on the validation set. Model version v_20260412_214634, 87 validation frames, macro F1 0.97. Rows are actual labels, columns are predicted. Eyes_open row: 8 correct, 1 predicted eyes_closed (89%/11%). Eyes_closed row: 0 wrong, 78 correct (0%/100%). Only 9 eyes_open frames in the validation set, so a single miss drops eyes_open recall to 0.89.

Macro F1 0.97 sounds great until you notice there are only 9 eyes_open frames in the validation set. One missed eyes_open drops that class's recall to 0.89 and drags macro F1 with it. The model is actually perfect on eyes_closed (78/78) and one miss away from perfect on eyes_open — but the rare class is eating the score, and the fix isn't a better classifier, it's more eyes_open validation data. This is the "class imbalance is the silent killer" story from part 2, made concrete.


Implementation details (light)

The full stack:

  • Python 3.12 (PyTorch's only constraint is <= 3.13 so I'm locked here for now)
  • PyTorch + torchvision for the MobileNetV3-Small models. No custom layers, no exotic architectures, just torchvision.models.mobilenet_v3_small(pretrained=True) with a swapped final layer.
  • OpenCV (opencv-python-headless) for image I/O and the bassinet crop. Headless because I don't need GUI windows.
  • ffmpeg for the actual RTSP frame grab. PyAV would be more "Pythonic" but ffmpeg is already on every Mac and the cost of one process spawn per minute is invisible.
  • SQLite + WAL mode for storage. ~6ms for a 24-hour timeline query. No ORM, just sqlite3.connect() and parameterized queries.
  • Flask for the dashboard. No frontend framework, no build step, no node_modules. The whole UI is one HTML file with vanilla JavaScript hitting JSON endpoints.
  • launchd for scheduling. One .plist for the capture job, one for the dashboard (persistent), one for the daily retrain (manually triggered now). No cron, no systemd, no Airflow.
  • python-telegram-bot for the alerts. Bot token in a .env file that's chmod 600 and gitignored.

Total dependencies: about 12 packages. Total disk footprint of the venv: ~1.5GB (PyTorch is the heavy hitter). Total lines of Python in the project: ~3,000.

You could swap any of these out without restructuring the project. If you're on Linux, replace launchd with systemd. If you're on a Pi, replace MobileNetV3 with whatever runs on your accelerator. The architecture is independent of the implementation choices — that's intentional.


What's in the dashboard (and why)

The dashboard is the part of the project I underestimated at the start and ended up spending the most time on. It's one HTML file, one CSS file, one JavaScript file, and a Flask app with ~40 JSON endpoints. No framework. No build step. Every panel on it earns its place by answering a specific question I kept asking the monitor out loud.

Here's what's on the page, top to bottom, and what each piece is for:

1. Status bar (sticky header). Current state (Awake / Asleep / Not in bassinet), how long BILBO has been in that state, any active alerts, clock, and a health indicator for the capture job. This is the "I just glanced at the dashboard on my phone at 2am, is everything OK" view. Everything else on the page is drill-down.

2. Live frame (hero card). The latest captured JPEG with a countdown to the next capture. This exists because every other metric on the page is derived, and sometimes I want to see the unprocessed pixels myself — especially when the classifier is claiming something counterintuitive. The countdown answers "did the launchd job die?" at a glance.

3. Timeline strip. A horizontal day-view showing every frame as a colored block (in bassinet / awake / out of bassinet), with date navigation. Click a block to open it. This is the main "what happened today" view, and it's how I actually review periods I wasn't watching live.

4. Daily bassinet time bar chart. Daily hours-in-bassinet over the last 7/14/30 days. Not a safety feature — it's context. Sleep volume is one of the few things that actually tracks how a newborn is doing week over week, and having it right next to the rest of the monitor data means I never bother opening a separate sleep-tracker app.

5. Block detail panel (opens on click). This is where all the real work happens. When you click a timeline block, it opens a frame-by-frame viewer with:

  • A scroll-through of every frame in the block, with the face bounding box overlaid
  • A per-frame state dropdown for corrections (eyes_open / eyes_closed / face_not_visible / not_in_bassinet)
  • A "bulk-label" option that applies one label to every frame in the block at once — critical for efficiency, because a 30-minute nap is ~30 frames and I am not clicking each one
  • A "Reviewed" checkbox that promotes a block from "model output" to "human-confirmed ground truth" — this is what separates validation data from raw training signal
  • A "Run Inference" button that re-executes the current BIRDEYE model on the frame and updates the stored classification — the way I sanity-check a freshly trained model on frames it's gotten wrong historically
  • A face-box draw tool for correcting wrong bounding boxes, which feeds the face-detector retraining

6. Pending corrections table. Every correction I've made that hasn't yet been folded into a training run, broken down by class (eyes_open corrected to eyes_closed happens a lot; not_in_bassinet corrected to eyes_open is rare). This is a running to-do list for the retraining loop. Empty state reads "all corrections have been used in training" — I look at this before deciding whether a retrain is worth it.

7. BIRDEYE classifiers card. The production side of the ML system. For each classifier (presence, face detection, eye state) it shows: the current deployed version, training-set size, validation macro F1, confusion matrix, and — critically — the production metrics for the last 6h/12h/24h/7d computed against frames I've since corrected. The training metrics tell me what the model learned on a frozen split; the production metrics tell me how it's doing on frames the validation set never saw. They diverge over time and that divergence is the signal to retrain. This card is also where the "Retrain Model" button lives (with a "skip face detector" option because retraining the face detector takes ~60 min and usually isn't necessary).

8. Pipeline card. Operational health: cloud API spend (current prod), BIRDEYE average inference latency, and a count of capture gaps > 10 min. The cost number is the one I watch most after a model deploy — if it ticks up, it means BIRDEYE is dropping more frames into the cloud fallback, which means the new model is worse than the old one in a way the validation metrics missed.

9. Recent events table. A chronological list of state transitions with durations (Asleep 2h 14m, Awake 6m, Out of bassinet 45m, ...). This is what I'd actually want if the rest of the dashboard didn't exist — it's the minimum viable view of "what has the baby been doing." I kept it because it's the fastest way to spot a stuck-state bug where the timeline is drawing one giant block because the classifier has been agreeing with itself too long.

The thing I want to call out about this list is that every panel exists because of a specific question I found myself asking the monitor out loud, often at 3am. The status bar answers "is it awake?" The live frame answers "is the classifier gaslighting me?" The block detail panel answers "what did the model actually see when it said that?" The production metrics answer "has the model degraded since the last deploy?" If I'd tried to design the dashboard upfront I'd have ended up with half the right features and twice the wrong ones. Letting the panels accumulate organically — each one in response to a moment of confusion — is the thing I'd do the same way again.


What I would improve

Knowing what I know now, the next round of improvements would target the failure modes I haven't fixed yet:

Temporal smoothing. Right now each frame is classified independently. A 3-frame median filter on the eye-state output would catch most of the spurious blinks and motion-blur false positives. The data is already in the database; this is purely a query change.

Multi-modal sensing. An IP camera with a microphone (or just a separate USB mic) gives me an audio stream I could run a small wake-detection classifier on. Audio reacts faster than video for a crying baby, and the two signals are independent enough that combining them via simple ensemble would meaningfully improve recall on real wake events.

Better night-vision data. I have ~5× more daytime frames than night frames, mostly because the dashboard correction workflow is naturally biased toward "frames I happened to review during the day." A targeted active-learning loop — surface the night frames with the lowest confidence and ask me to label them — would close the gap fast.

More robust face detection in occluded poses. My trainable face detector is at 100% on the validation set but I know it fails on certain occluded poses (face turned 80° away from camera, half-buried in swaddle). The training data doesn't cover those well because they're rare. An augmentation pipeline that simulates partial occlusion would help.

A trained BassinetLocationClassifier — the proper version of an edge alert I tried and had to delete. Same MobileNetV3-Small architecture as the others, binary pressed_against_side / not_pressed, bootstrapped from the cloud API's position labels. I have an issue tracking it; one Saturday's work to actually do it. (The story of why I deleted the original attempt is in part 2.)

Notice that none of these are "make the model bigger" or "use a better architecture." They're all about giving the model better inputs or smoothing its outputs. That's where the wins come from at this scale.


Conclusion

I built a baby monitor on an old Mac because I didn't trust the commercial ones and I wanted to see if I could. It cost about $40 in hardware. It runs entirely on my home network. It uses a small ML pipeline that I trained on data I labeled myself. Most frames never hit the public internet; a shrinking tail ends up at the cloud API as a low-confidence fallback, for a few cents a day.

That's the build. In part 2, I get into the meta: what this project taught me about deploying ML under real constraints, the bugs that cost me a morning each, and what working with a coding agent actually felt like — including the parts where it didn't help.

The baby is asleep right now. The dashboard says so. The model is 100% confident. I checked.