Whewww, 23h 12m logged! Truth be told I didn’t believe that I would make STATERA presentable in time before flavortown ended (less than 2 days left yikes), but here we are, with a single devlog for 23h 12m no less (as I said I never expected it to be completed in time)!
So what is this massive project? STATERA (Spatio-Temporal Analysis of Tensor Embeddings for Rigid-body Asymmetry) is basically a zero-shot physics engine inside a neural network. It watches an opaque box tumble in the air and mathematically calculates exactly where its hidden internal center of mass (CoM) is, just from the raw video.
What have I been doing for 23 hours? Fighting physics and neural networks. I generated 50,000 simulated tumbling videos to train a frozen V-JEPA vision model on my single RTX 5070 Ti. I had to build a custom 2.5D decoder and a mathematical extraction pipeline using OpenCV just to track the physics accurately. (and yes obv the training time is not in the hours of work, hackatime sends heart beats only when you’re actively coding… the cumulative training time including all the ablations is well over 2 weeks!)
The biggest challenge yet? The model actually overfit to gravity! It learned a “Settling-State Bias” where it realized heavy things usually face the floor when they land in the simulation. So when I tested it in the real world in my bedroom and the box bounced with the heavy side UP, the network’s brain literally split in half trying to decide between the visual bounce and its gravity bias[I call this Visual-Kinematic Aliasing for now]. Fixing this math by changing the training curriculum to a phase-agnostic gaussian dot took some time (but nothing compared to the time it took to come up with the previous 50K curriculum which failed with the bias. It used a crescent target that forced the model to learn angular orientation, which accidentally taught it the gravity trick)!
What’s next? The final 50K model is finishing its run tonight around 9pm. While that finishes, I am going to build a slick local web UI for the demo so you guys can just upload a video and watch the physics tracking happen live. See you in the next devlog! [the attached video shows the last 50k model splitting up its prediction as it fights internally between the face it thinks is correct from observation and the one which is facing down which it learned to be correct most of the time. The crosshair is the actual ground truth from my tracking and you can see where the model thinks the CoM is from the heatmap density! I have 76 such videos as my real-world validation test. Also did I mention I would be releasing a 50K benchmark for other people who want to test on my baseline? I do also believe that the accuracy will scale beautifully with more compute, which I simply don’t have right now (fun fact, each 50K run takes around 4.5 days to complete and a 1K run around 6-9 hours on my 5070ti i.e., also the 50K dataset alone is over 200GB!)]
(also the first one it where it splits and the wrong face wins, and in the second one, the correct one was selected, you might also think it is guessing which face and is bound to get it correct 50% of the time, but the other 74 runs not shared here make me believe otherwise…)