Seer banner

Seer

8 devlogs
17h 51m 8s

Ever wished you could zoom in? What if you could zoom in so much you would want it to stop? Well, here you go.

This project uses AI

I used AI to help me with some new libraries. It did not write any of the code in my codebase, simply taught me syntax for those specific libraries.

Repository

Loading README...

albert

Ok, so there are a few things that I tried doing. Even after making the switch to the grounding DINO model, which perfectly predicted the bboxes of each object, the 3D projections still look like ahhh.

I know the issue is somewhere in the raycasting code, but I can’t lie ts frying me already. I tried making a few changes, but literally nothing worked. So now, I’m gonna switch to manual annotation. It’ll be more tedious than before, but at least it’ll work (hopefully).

Changelog

Attachment
0
albert

After looking at the projections of the Gemma-made bboxes in the environment, I was not having a good time. So, I decided to switch to a second model for bboxes while keeping object detection on Gemma.

I used IDEA-Research’s Grounding DINO model, which, given an object name, estimates a bounding box around it. I took the object names from Gemma’s output and put them into the classifier, creating this beautiful JSON you see below. Haven’t yet run the raycasting algorithm on them yet.

Changelog

Attachment
0
albert

Once the bounding boxes were gotten from Gemma, I only had to make the raycasting pipeline to project the 2D bounding box onto a 3D plane. The first thing I did was calculate camera poses using COLMAP, then I used those camera poses to project rays onto each image, recording all data in dictionaries.

I actually managed to learn a whole lot about looping through dictionaries and their intrinsic qualities, which was really fun. The only drawback is that the time complexity is not favorable: O(n^2), where n is the number of frames/images.

The only bug to fix is in the COLMAP database, where the initial pair of images is not being calculated. This is probably because there is too small of a difference between the camera poses in each frame. I’m not sure yet if this is true or how to even fix this.


Changelog

Attachment
0
albert

Switched fully to Gemma for the frame predictions cause its lowkey WAY more accurate with its text output. Also gave it context in the prompt from past frames so that it predicts things across frames more consistently.

Below is an image of the JSON file it outputs for all 450 images.

Changelog

Attachment
0
albert

I have changed the prompt to estimate bounding boxes for each object in the 2D image. Now, I have to project that into the 3D environment, and make these bounding boxes actually three-dimensional. To do this I will use raycasting and camera pose to estimate distance to objects and therefore their positions in 3D space.

I’m also thinking of switching to Gemma instead of Nemotron, since Gemma is not just a VLM, but is mutlimodal, so will often have better text/JSON outputs.

Changelog

Attachment
0
albert

On the movement and environment side of things I added the fabled zoom mechanic (it was really easy) and some speed control.

The most important new thing is that I am now using NVIDIA’s Nemotron with OpenRouter to analyze images and output the object it most likely is and the materials it’s made of.

I initially used Qwen, but it js wasn’t precicse enough for my use case.

Next up, I’ll have to find a way to create bounding boxes around specific objects in the image so that I know where to start and stop each material. Afterwards, it’s the actually fun part of making the lookup table so that each material can be linked to a specific molecule.

Changelog

Attachment
0
albert

Good news! I was able to load the OBJ into Panda3D and create some basic movement capabilities including WASD and mouse movement. The only things that I have to do before I get to the physics are adding zoom and maybe re-recording the video (since it’s kinda low quality, which messes up how the 3D scan looks)

Changelog

Attachment
0
albert

So I made the pipeline that turns videos into .obj files through the intermediary .usdz (Apple’s proprietary 3D file format).
I had to first take the video from many different angles, then turn it into a folder full of images that were taken every two frames, and then, had to use some SwiftUI code to use Apple’s RealityConverter.
Then, I decided to use aspose 3d to convert to OBJ, but ran into some coloring issues, since the USDZ is a zip archive, so I had to extract the texture images from it and insert them into the OBJ.

The result is actually not too far from what it really is!

Attachment
0