OmniLab banner

OmniLab

22 devlogs
51h 26m 32s

OmniLab is an interactive, Iron Man-inspired Heads-Up Display (HUD) system that runs entirely locally to ensure zero latency. Developed by me (EngThi), it uses Python and MediaPipe for real-time hand gesture tracking via webcam, and the native Web…

OmniLab is an interactive, Iron Man-inspired Heads-Up Display (HUD) system that runs entirely locally to ensure zero latency. Developed by me (EngThi), it uses Python and MediaPipe for real-time hand gesture tracking via webcam, and the native Web Speech API for voice recognition. This data is sent via WebSockets to a local FastAPI server, which renders a 3D interface in the browser using Three.js.

This project uses AI

I used the Gemini CLI and Perplexity as pair-programming assistants and mentors. They helped me set up the initial project structure, autocomplete boilerplate code, debug errors, and fix syntax mistakes in what I was writing. I also used them to bounce architectural ideas around (such as keeping real-time computer vision local to avoid latency). I strictly followed a ‘no black-box’ rule: I did not let the AIs just generate the project for me. Any code snippet provided by the AI was reviewed, and I asked for explanations of the parts I didn’t fully grasp to ensure I understood the underlying logic (like WebSocket communication and MediaPipe mechanics) and remained the actual developer of the system.

Demo Repository

Loading README...

ChefThi

I’m writing this devlog in a rush. My laptop battery died, and since I’m editing this post on my phone (there’s no extension here to add commit changelogs), it ended up looking a bit hazy.

Summary: I spent about two weeks trying to ship this project, but Hack Club does not accept project demos on platforms that do not have a kind of “forever” free quota. Type of Render and Railway that I was using. I found out Yesterday I learned that Hack Club could provide servers for Hack Club members, so I signed up and was approved right away. I spent the whole day testing this with MediaPipe Tasks for Web and after a long time and with the help of Gemini CLI, I finally reached the point of deploying the project.

Unfortunately, Playwright doesn’t work because the server’s datacenter IP address is reporting thatSince it’s not from a regular user, it ends up blocking web requests.

Attachment
Attachment
Attachment
Attachment
0
ChefThi
  • Final cleanup of README and confirm Gemini 3 fallback router (84a9f86)
    to
  • Fix: migrate to Stealth 2026 context manager (51ef0f7)
Attachment
Attachment
0
ChefThi
  • Fix hand tracking toggle, Render headless mode, and Playwright screenshot extraction (42a9636)
    to
  • Refactor search to use direct Playwright with –no-sandbox for Render stability (d1637ed)

I refactored the browser search flow to stop relying on the MCP bridge for simple navigation and screenshots.

Replaced the old mcp_bridge.call_tool("browser_navigate") + screenshot logic with a dedicated capture_screenshot(url) helper that uses direct Playwright:

Key changes in server.py:

  • New async function with chromium.launch(headless=True, args=["--no-sandbox", ...])
  • Applied playwright_stealth + proper wait_until="networkidle" + 5s buffer for heavy JS
  • Screenshot as JPEG quality 60, returned as clean base64
  • Updated both manual search and HOMES_SEARCH_BROWSER paths
  • Improved error messages and status updates (“STARTING_PLAYWRIGHT”, “SEARCH_COMPLETE”)

The flow is now simpler, faster to debug, and much more reliable on Render/cloud deployments where sandbox and shared memory can cause issues.

Quick but focused session after classes. Seeing the screenshot arrive cleanly in the HUD without MCP complexity felt like removing unnecessary weight. The tactical interface stays responsive while the agent does real web work.

I still need to work on and improve all of this. I’ll have to add better logs to the system and Playwright screen, especially (they don’t all appear currently). Also, when I was preparing to make this Devlog, I received a notification on Slack that the reviewer rejected my ship, saying that rendering wasn’t allowed and that I should use something like Vercel or Cloudflare Pages… I didn’t understand because this is more backend-related, and he wanted deployment on sites that work with static archs

Attachment
Attachment
1

Comments

ChefThi
ChefThi 2 days ago

I didn’t want to format the rest of the post as .md
-

ChefThi
  • feat: add hand-tracking toggle and real-time HUD movement logic (921aaaf)
    to
  • feat: enable dual-source vision (cloud + local) for remote tracking (4744461)

Today I tried to made the vision pipeline way more reliable by implementing dual-source frame acquisition in vision.py.

What changed:

  • Primary source: frames from cloud/WebSocket frame_queue
  • Automatic fallback: if no cloud frame arrives within 10ms, switches to local cv2.VideoCapture(0)
  • Added source tracking (“CLOUD” or “LOCAL”) and updated overlay text to show it clearly alongside gesture name
  • Kept full MediaPipe HandLandmarker processing unchanged
  • Safe cleanup: webcam is only released if it was actually opened

The gesture loop now stays alive even if the remote browser connection drops temporarily — perfect for long stealth automation sessions or when running in mixed environments.

Focused session after classes. Seeing the overlay switch smoothly between SRC: CLOUD and SRC: LOCAL while gestures kept working felt like removing the last weak link in the chain. No more vision dying when the stealth agent goes off-screen.

Combined with the recent persistent stealth Perplexity tools, MCP bridge, and robust demo/fallback HUD, OmniLab is becoming a true resilient local agent.

And also, as you can see, the model was in a mess when I went to prepare the Devlog. So it didn’t even receive the form that was sent. But this is fully functional.

Unfortunately, the HUD is being disobedient… I couldn’t get it to actually follow the hand gestures. Locally, things are going well, but to reflect in the render, I need to push the changes to the remote and wait for the modifications to build, which delays testing and debugging a bit. And the URL search does ot work

Attachment
Attachment
0
ChefThi
  • feat: implement browser-side camera capture for Render deployment (02abd48)
  • feat: implement full cloud-vision pipeline for Render (clean code) (c5ecb5d)

making this devlog for carry the commits changelogs (If I use other machine or my phone it’s don’t get availiable, because this changelog feature is an extension (Chrome/Firefox)

Here I made changes to change how the deployment on Render accessed the processed images and kept testing the system. I was trying to make the HUD work like locally (in this case there will be an annoying delay anyway. The distance, latency, and processing from the user’s browser to the Render server cause this…)
The capture of the frames had not yet been implemented here. I think somewhere in the middle of the edits I ended up getting lost and didn’t make many improvements in this part of the system.
In this part I used the Gemini CLI to try to explain to myself and debug how to make the system work on Render. Helping it helped, but I felt I could have done more… kind of didn’t get anywhere -

Attachment
0
ChefThi
  • refactor: stabilize MCP bridge and pivot branding to HOMES (1ca3baa)
    to
  • feat: implement persistent stealth automation and Perplexity search engine (b40e296)

Today I pushed the automation layer to the next level with full persistent stealth capabilities using Playwright.

Created a suite of tools that reuse real Chrome sessions (launch_persistent_context + user_data_dir=./.playwright_data) and apply playwright_stealth to bypass detection. Focused on Perplexity AI as a powerful external brain:

New scripts added:

  • stealth_agent.py — headless/off-screen stealth navigation with anti-detection flags
  • perplexity_agent.py — persistent login flow (manual Gmail step + 180s wait)
  • find_history.py — searches and extracts OmniLab-related threads from sidebar
  • perplexity_chat.py — automates follow-up questions in existing threads
  • Helper scripts for layout inspection and screenshot validation

Intense after-class session. Seeing the stealth agent open Perplexity, find old OmniLab threads, and send a clean follow-up without triggering any blocks felt like unlocking a new superpower. The HUD can now decide to query Perplexity semantically via MCP and bring rich answers back to me.

Combined with the recent MCP bridge and robust demo mode, OmniLab is evolving into a true local command center that can use the entire web intelligently and feed my HOMES pipeline with high-quality data. Next: wire Perplexity actions directly into the gesture/voice flow and add mock versions for flawless demos.

** P.S. I used AI to structure this post. I organized and went through the things I had worked on and made a briefing of the things. Also, to test the scripts and some tests I used CLI to improve my errors and accelerate this test part.**

Attachment
0
ChefThi
  • Revise README for improved clarity and structure (67f9237)
  • Update model_id to ‘gemini-3.1-flash-lite’ (ea6b70f)
  • Update AI technology version in README (41cfad9)
  • feat: implement demo mode mocks and prepare MCP architecture (5e3e9f6)
  • feat: implement McpAgentBridge for semantic browser automation (344ab15)

Today I took the biggest leap toward a real local agent: implemented the official Playwright MCP (Model Context Protocol) bridge.

Instead of fragile direct navigation or pyautogui clicks, OmniLab now talks to the browser through semantic tools. The new McpAgentBridge class starts the @playwright/mcp server via stdio, manages ClientSession, lists available tools, and executes them cleanly with call_tool().

  • Full McpAgentBridge with start/list_tools/call_tool/stop
  • Integrated into FastAPI lifespan alongside the existing browser setup
  • Updated handle_agent_action() so BROWSER_SEARCH_RECIPE now triggers real MCP tools
  • Cleaned up old direct calls and unused imports

The flow is now: Gesture/Voice → Gemini decides action → MCP executes semantically in real Chromium → status update back to the tactical HUD.

It still needs more tool mappings and human-like delays, but the foundation is solid and future-proof.

Seeing MCP Agent Connected to Playwright Tools” and the first semantic action fire without breaking the HUD felt like JARVIS finally getting hands. No more “just describe the frame” — now it can actually DO things on the web and feed my HOMES pipeline.

Attachment
0
ChefThi
  • feat: final tactical polish for SHIP - English demo mode and agent search fix (0710e20)
  • feat: implement high-fidelity International Demo Mode with mouse tracking and English localization (11bfc83)
  • docs: translate to English and enable automated GitHub Pages deployment (d7b8db4)
  • fix: ensure Demo Mode activates on GitHub Pages by handling Mixed Content WebSocket errors (71efe67)

Wrapping up things for this week of LockIn, I decided to make the DEMO via GitHub Pages as I had been doing, explaining I made it as a Mock for the reviewer and in general to let them test a bit of how it is without needing to install all the dependencies. Playwright, GEMINI_API_KEY, camera, etc.

The AI that I was using ended up making some changes in the server.py and html part so I tweaked a few things and delegated to it to fix what was missing. Then I asked for a deploy script in Actions and that’s what I got!

Attachment
0
ChefThi
  • feat: implement automatic HUD orbit and fallback demo mode for reviewers (bfab249)

Today I made the HUD way more robust and demo-friendly — exactly what reviewers need.

Implemented an automatic fallback system: if no real data arrives from the WebSocket for more than 2 seconds (webcam offline, backend delay, or during recording), the HUD smoothly switches to demo mode with a beautiful orbiting cursor animation.

What was added in static/index.html:

  • Data flow monitoring with lastDataTime and 1-second checks
  • startDemoMode() using sin/cos math to simulate natural cursor movement, periodic pinch_progress scans, and fixed 60 FPS
  • Seamless transition: real WebSocket messages instantly stop the demo and take over
  • Improved onopen/onmessage/onclose handlers with auto-reconnect + fallback

The tactical UI now stays alive and immersive 100% of the time — perfect for videos, quick demos, or when showing the project without perfect hardware.

Quick but focused session after classes. Seeing the cursor start orbiting smoothly when I paused the vision server felt like magic. No more awkward “wait, it froze” moments during recordings.

Combined with yesterday’s AI mocks and DEMO_MODE, OmniLab is now extremely easy to showcase. Reviewers can open the page and immediately see the full Iron Man experience without any setup pain.

Attachment
0
ChefThi
  • feat: add demo mode and ai mocks (8ff8f22)

Demo Mode + AI Mocks Zero-Dependency Showcase for my first ship

Today I added a full demo mode so OmniLab can run beautifully without a webcam or real Gemini API key — perfect for quick testing, recording timelapses, and showing the project to others.

What was implemented:

  • New DEMO_MODE flag in .env (true = mocks everything, false = production)
  • Cycling mock responses with realistic 0.6s simulated latency
  • Guarded Gemini client creation so it only initializes when needed
  • Added 4 hand gesture sample images in static/demo/ for visual consistency
  • /analyze endpoint now returns clean JSON with demo: true flag when in mock mode

The HUD and gesture pipeline stay exactly the same — you still see the tactical overlay, pulse effect, and “Deep Scan” flow, but everything is simulated and stable.

After a long day of classes I wanted something that would let me record clean demos without fighting hardware. Turning DEMO_MODE on and seeing the mock responses flow perfectly into the HUD felt super satisfying. No more “sorry, needs webcam” excuses.

This makes OmniLab way more shareable and production-like. Combined with the recent Playwright stealth work, we’re getting closer to a full local agent that can demo real browser actions without any external dependency.

Attachment
0
ChefThi
  • fix: stabilize vision-server bridge and synchronize Gemini 3.1 models (81ac80e)

OMNILAB // RESILIENCE & BROWSER HANDS 🛡️

Spent the last session bulletproofing the core architecture. I refactored the vision module to use a multi-threaded loop so it doesn’t just die if the connection drops. Now it lowkey waits for the
server to come back online automatically—no more manual restarts. I also standardized everything on Gemini 3.1 Flash Lite for that low-latency speed boost.

The big win was expanding the gesture engine. I implemented Swipe, Thumbs Up, and Fist recognition, and mapped them to actual browser actions using Playwright. Seeing the HUD trigger a stealth search or navigate tabs just by moving my hand was the ultimate vibe check. I also hunted down a sneaky MediaPipe indexing bug that was causing hard crashes during fast movements. The invisible interface is finally starting to execute real intent instead of just describing the scene.

Attachment
0
ChefThi
  • fix: resolve variable scoping in vision bridge and enhance loop stability (d38cadc)
  • feat: implement tactical control panel and visual telemetry fixes (1dd8917)

basically I found errors on the panel. It’s not appear correctly before.

  • I used the Gemini CLI for a quick and simple fix in this part :)

P.S. I noticed that my recorder don’t was saved. The Screenity extension got an error after I completed the video

Now there was a visitor at home then I went to greet them
-

Attachment
Attachment
0
ChefThi
  • fix: resolve playwright-stealth imports and fastapi validation errors (9d2ada5)

Playwright Stealth + FastAPI Validation Fixed Browser Control Now Stable

Quick but important cleanup session today.

Fixed two blocking issues that were breaking the new browser automation layer:

  • Corrected playwright_stealth import and usage: switched from stealth_async to stealth so the browser launches with proper human-like fingerprints (anti-detection for Cloudflare, Google, etc.).
  • Enforced proper Pydantic validation on the /analyze endpoint: changed request: any to request: AnalyzeRequest (BaseModel with base64 image field). This prevents malformed payloads and makes the API more reliable when Gemini or voice triggers actions.

Also added the new libs for v0.3 (Playwright ecosystem + dependencies).

The pipeline is now much more solid: MediaPipe gesture/voice → Gemini analysis → execute_system_action → Playwright with stealth can open real tabs, navigate, and interact without immediate blocks.

Short focused session after classes. Seeing the stealth apply correctly and the FastAPI endpoint stop throwing validation errors felt like removing training wheels. No more random crashes when the HUD tries to trigger a browser action.

OmniLab is evolving from “cool HUD that describes frames” into a true local agent that can actually use the browser as part of my HOMES workflow. Next target: full BROWSER_ACTION handler with human-like delays and real task execution (e.g. open recipe site → extract ingredients → trigger HOMES-Engine). The invisible interface just got way more powerful.

Attachment
0
ChefThi
  • feat: core evolution - gesture control, Gemini 3.1 Thinking Mode, and modular HUD (64057c1)
  • add new libs for the v0.3 (634a0a4)

OmniLab Devlog // v0.3 Checkpoint

Yo, just dropping a quick update on what’s been happening with OmniLab. The last two commits were lowkey a mess—honestly, they were just checkpoints to save where I was at, so they didn’t really work out of the box.

The Struggle (aka The Errors)

So, when I tried to actually run the code from the recent pushes, the system basically threw a tantrum.

  1. Import Drama: In server.py, I tried to pull in stealth_async from playwright_stealth, but it just wasn’t having it. Total ImportError. Had to swap it for the standard stealth function to get the browser agent to even start.
  2. FastAPI Tantrum: The /analyze route was broken because I used any as a type for the request. FastAPI is super picky about that, so it crashed with a FastAPIError. I had to bring back the proper Pydantic models to make it happy again.
  3. Browser Missing: Playwright was installed but the actual Chromium browser wasn’t there. Pro-tip: playwright install sometimes fails, so using python -m playwright install chromium is the way to go.

What’s Actually New

Even though it was bumpy, we got some cool stuff in:

  • Gemini 3.1 Thinking Mode: The brain is officially upgraded. It’s faster and actually “thinks” before it gives you the tactical report.
  • Pinch-to-Scan: This is the best part. You don’t have to yell at the mic anymore. Just hold a pinch gesture for 1.5s, the HUD ring scales down and changes color, and boom—it triggers a deep scan.
  • OmniBrowser Agent: We added Playwright so the HUD can lowkey browse the web for you. It’s not fully “Jarvis” level yet, but it can navigate and pull data in the background.
  • HUD v2.1: New tactical UI with a log console at the bottom and real-time FPS/latency tracking so you know the system isn’t lagging.
0
ChefThi
  • feat: evolve OmniLab into an active command center for HOMES ecosystem (dc3b7ba)

OmniLab becomes Active Command Center for HOMES 🔥 Gesture → Real Action

Today I took the biggest step yet: turning OmniLab from a passive scan tool into a true command center that can execute actions inside the HOMES ecosystem.

Major refactor in server.py:

  • WebSocket connections now use sets for true O(1) operations
  • Re-used the image caching + resize pipeline (MD5 dedup + 512×512 JPEG)
  • Added execute_system_action() handler with real examples:
    • “HOMES_EXECUTE_TASK” → placeholder to trigger Termux workers / video rendering
    • “BROWSER_NAV_NEXT” → pyautogui hotkey (Ctrl+Tab) as proof-of-concept
  • Broadcast logic cleaned up so vision → HUD communication stays rock-solid

The flow is now: Pinch gesture (or voice) → MediaPipe → Gemini analysis → action decision → execute locally or fire HOMES pipeline.

It still needs the actual webhook to HOMES-Engine, but the architecture is solid and the HUD stays responsive.

After classes I went straight into a long refactoring session. Seeing the action handler print “Executing HOMES_EXECUTE_TASK” for the first time felt like JARVIS finally waking up. No more “just describe the frame” — now it can DO something.

OmniLab + HOMES together are starting to feel like a real personal AI operating system. Next: full voice + gesture synergy and actual integration with HOMES worker queue. The invisible interface is getting dangerous. 🤖⚡

I get a bit lost during this development. But we got improvements!

0
ChefThi
  • perf: implement image caching, O(1) connections, and asset optimization for sidequest v0.2 (1738e81)

I just finished a heavy optimization session to kill the lag in OmniLab. I sat down with my AI assistant to tear apart the bottlenecks, and we managed to turn this from a “cool prototype” into a high-performance local AI.

What we changed (The “Brain” Upgrade):
Smart Memory: The system now remembers what it just saw. Using Image Caching, it won’t waste time or API tokens re-analyzing the same frame if nothing has moved. It’s like giving the HUD a 30-second short-term memory.

Instant Connections: I swapped how the HUD tracks connections. By moving from “lists” to “sets,” the system now handles multiple data streams instantly, no matter how many are running.

Lightweight Assets: We automated an image-shrinking process. Before sending anything to the cloud, the HUD now compresses and resizes frames. This makes the data 84% lighter without losing the “vision” quality Gemini needs.

The Numbers (Why this matters):
Speed: Response time dropped from 820ms to 540ms. It feels way snappier.

Efficiency: We went from 60 API calls per minute down to just 3 or 8. No more wasting tokens on duplicate images.

Stability: The HUD is buttery smooth now, even during heavy “Deep Scans.”

It started as a quick after-class session and turned into a solid grind. Seeing the latency numbers drop in real-time was incredibly satisfying.

Attachment
0
ChefThi
  • feat: add HUD demo mode, scan pulse effect, and dynamic port binding (90ab4c8)
  • feat: core HUD improvements and repo cleanup (c82deea)
  • chore: ensure all private project files are untracked (2daea10)
  • feat: implement concurrent vision processing and HUD fail-safe systems (a954cfb)

Concurrent Vision + HUD Fail-Safes Parallel Power Unlocked

Big day — I finally tackled the last major bottleneck: sequential scan delays.

Implemented concurrent vision processing so frame capture, MediaPipe analysis, WebSocket transmission and Gemini 3 Flash calls can run in parallel without blocking the main HUD thread. Added robust fail-safe systems (graceful degradation, timeout recovery, and fallback states) so the interface never freezes even if the LLM takes longer than expected.

What landed in this session:

  • Full concurrent pipeline using Python asyncio + ThreadPoolExecutor for vision tasks
  • HUD fail-safe layer with visual indicators when processing is happening in background
  • Minor core improvements and repo cleanup (removed private files from tracking)
  • Combined with yesterday’s demo mode, scan pulse effect and dynamic port binding — the whole system now feels way more stable and production-like

After classes I went straight into a long session. Seeing the scan pulse animate while the AI thinks in the background without any stutter… that’s the JARVIS moment I’ve been chasing.

FOR THIS DEVLOG ONE THING ARE HAPPENED. THE MODEL OF I SET (gemini-3-flash-preview) WAS EXPERIENCING A HIGH DEMAND. SO I SWITCHED FOR THE 3.1-lite-preview

Attachment
Attachment
Attachment
0
ChefThi
  • feat: complete gesture-to-scan logic and HUD v2.1 tactical UI (8f09f3b)

Gesture-to-Scan Complete + Tactical HUD v2.1

Today I finally closed the loop on the most important interaction of OmniLab: turning a simple hand gesture into a full AI-powered scan.

The big challenge was making the flow feel instant and reliable. I refined the MediaPipe Tasks API logic so the Pinch gesture (held for 1.5s) now reliably captures the webcam frame, sends it through the local FastAPI pipeline, and triggers Gemini 3 Flash Vision without breaking the HUD.

What’s new in this push:

  • Improved state management so the system no longer queues scans sequentially — each Deep Scan now feels more independent
  • Small cleanups in api/v1, core, vision.py and utils for better maintainability

It’s still not 100% min-latency (Gemini still takes a moment to think), but the difference from last week is huge. The HUD now truly reacts to my hand like JARVIS would.

Late-night session after classes, but seeing the tactical report pop up instantly after the pinch hold made it all worth it. The invisible interface is getting closer every commit.

0
ChefThi
  • feat: implement tactile gesture activation and HUD v2.1 modular update (133994c)

Deep Scan & Tactical Gestures🖐️👁️

After a few days, I finally implemented the Deep Scan system. The challenge was: how to trigger an AI analysis without touching the keyboard?

I used MediaPipe to create a "Pinch" trigger. By holding the gesture for 1.5s (Tony Stark style calibrating the HUD), the system captures the frame and sends it to the Gemini 3 Flash brain. The result: I get a simple instant tactical report directly on the display, running with very low latency thanks to the new local architecture.

I liked all of this, thought these new updates were cool. The thing is, there’s still a certain delay that I think makes it trigger the Scan one after another, not getting the complete description of the first one.

The OmniLab not only shows data now; it understands what I see. 🎧🔥

Attachment
Attachment
0
ChefThi
  • Refactor project details and shipping status (79c8e53)
  • feat: HUD v2 evolution with TTS, real-time diagnostics, and Gemini 3 Thinking Mode (674e279)

🚀 The HUD Just Leveled Up

The gap between thought and execution is getting smaller. I’ve just pushed a massive round of updates to the interface, bringing that “Stark Tech” vibe closer to reality.

What’s New:

Clean Decoupling (ada_v2 style): I moved the entire HUD interface to a static/ directory. By separating the Three.js frontend from the FastAPI backend, I can now tweak the UI instantly without touching the server logic.

Gemini 3 Thinking Mode: Deep reasoning is now live. When you trigger an analysis, the HUD displays DEEP SCANNING… while Gemini grinds through the image metadata to deliver a high-precision report.

J.A.R.V.I.S. Talk-Back: The HUD finally has a voice. Using the Web Speech API, the system now talks back during scans, making the whole experience feel way more immersive.

Real-Time Diagnostics: I added a telemetry overlay to monitor FPS and latency. It’s essential for keeping everything buttery smooth on my local Debian 13 setup.

Pinch-to-Lock Gestures: The “Pinch” gesture now locks the cursor and toggles system states, allowing for much tighter physical interaction with the 3D interface.

The “invisible interface” is finally start to be real

Attachment
0
ChefThi

Turning OmniLab into a real HUD assistant: voice, vision and a more proactive AI persona 🎧🖐️

OmniLab has been my experimental lab for interfaces: 3D HUD, hand‑tracking, voice input, and AI all living in the same space. At the same time, life got busier: I started Computer Engineering, the campus is ~10 km away, and I’ve been splitting my time between classes, Blueprint hardware projects, and these software labs. That’s why commits came in bursts instead of daily drips — most of the work happened in small, tired, late‑night sessions.
Earlier this year I refactored the architecture to favor local‑first vision (removing a cloud version that was too high‑latency) and added the Web Speech API to the HUD, so I could trigger Gemini analyses via voice while the system tracked my hands in real time. That was the turning point: OmniLab stopped being “just a cool 3D scene” and started behaving like a genuine interface between my body, my voice and an AI brain.
Recently I pushed a big “SHIP‑ready” upgrade: Gemini integration is now first‑class, tests and CI/CD are in place, and the HUD feels more stable as a product, not just a demo. On top of that, I refined the AI persona: instead of only answering direct questions, OmniLab now makes proactive observations about what it sees and hears — it can comment on the scene, suggest next actions, and feel more like a lab partner than a tool.

Most of this evolution happened while juggling buses, deadlines and other projects, with Perplexity helping me reason about trade‑offs (what to keep in 3D, what to simplify, where AI actually adds value). This devlog is my way of catching the Flavortown timeline up with the reality: OmniLab grew quietly, but it grew a lot. ✨

Attachment
0
ChefThi

OmniLab Devlog #1

I’ve officially kicked off OmniLab on my first laptop! Coming from a background of mobile development and browser-based IDEs, my first instinct was to keep everything “off-device”. I spent a good chunk of these 5 hours attempting to run the processing stack on a remote VM (Firebase Studio) and tunneling the HUD via a web page. However, the latency was unbearable for real-time tracking. I quickly realized that for a “Jarvis-like” experience, the vision loop must be 100% local.

Technical Hurdles & Git Mess

The first challenge was MediaPipe. I started with legacy code, but it wouldn’t play nice. I had to dive into the latest MediaPipe Tasks API docs to rewrite the landmark detection core. It’s much more efficient now, but the documentation shift caught me off guard.

Since I was jumping between cloud editing and local testing without properly cloning the repo first, I ended up with a mess of Git conflicts. I used the Gemini CLI as a mentor to help me untangle the branches, resolve the “already exists” errors, and get the local and remote repositories back in sync. It was a great lesson in maintaining a clean workflow on a new machine.

Current Progress

I’ve successfully implemented the “pinch” gesture logic (calculating the hypotenuse between thumb and index) and set up a local FastAPI server to bridge vision data to a Three.js HUD. The HUD now runs locally on Debian 13 (XFCE), which eliminated all the lag from my previous VM tests.

Timelapses

Attachment
1

Comments

ChefThi
ChefThi about 2 months ago

To clarify the technical choices: I’m focusing heavily on keeping the HUD lightweight on my new machine by using Debian 13 (XFCE) and optimizing the Python vision loop. I’m also studying the ada_v2 repository to implement better modularity in the UI layer. Integrating these clean interface concepts into a zero-latency environment is the main goal for the next update.