STATERA banner

STATERA

3 devlogs
26h 49m 55s

STATERA, a research framework that aims to extract the hidden Center of Mass (CoM) of opaque bodies from raw video using a V-JEPA backbone.

This project uses AI

Used Gemini 3.1 Pro Preview for most of the boilerplate, initial implementation, learning academic standards and README cleanup.

Demo Repository

Loading README...

animesh_varma

woooo, I forgot to write a devlog, guess it will show up after my ship… Oh well

anyway, so flavortown just ended and I managed to slip in my ship exactly 2 min before it did…
I am so filled with adrenaline now I can’t express ittt

also the worst possible thing happened before I was about to the ship (~15 mins were remaining to the deadline), I realized I hadn’t cleared the 15 vote vote deficit from my last ship of another project and had to provide 15 comprehensive reviews in lest then 15 mins, to say it was a trip through hell would be an understatement (but don’t burn me. I still provide valid feedback to people, just mostly positive as I had no time to critique their work…), and that is the story of my very weird ship! ;(

anyway, as i said in my ship, the project itself is pretty easy to use, if you don’t understand something, please feel free to mail me now with that out of the way, attached here with are some screenshots from the demo script, enjoy!!!

[also, for some context, the video pre-processing is required as V-JEPA is pretty strict about what it takes in and how long it should be to yield optimal results, the bounding box for the frames is resizable as if you increase its size it, the extract is downsampled so V-JEPA can ingest it…]
[can you tell I am burned out, is my speech coherent? NVM, please refer to the readme if you need to understand it better… sorry again…]

Attachment
Attachment
Attachment
Attachment
Attachment
Attachment
1

Comments

animesh_varma
animesh_varma 24 minutes ago

Clarification on Usability!! Just a quick note on usability! You might notice there is no live website link, and that is simply because STATERA is a heavy framework that needs 8GB+ of VRAM just to run a single forward pass. Since I don’t have the budget for dedicated cloud GPU hosting right now, I made it as easy as possible to run locally with an automated setup script. If you don’t have the hardware to run it yourself, no worries at all, I put full video demos right at the top of the README so you can see the UI and physics tracking in action without downloading a thing!

(P.S. If you want any specific video tested, please feel free to email it to me and I will provide you with the heatmap and the exact extracted metrics!)

animesh_varma

Shipped this project!

Woooooooooo, I am so very short on time, only ~2 mins till flavortown ends, so please I am so very very sorry that I can’t explain anything, just please look at the top of my readme and my devlogs and you shall understand what it is easily enough, there are videos in the readme if you want to see it in action without downloading anything, see ya, thankyouuuu!!!

and if you want to use it:
Just clone the repo on a linux (preferably arch linux) or macOS machine and run the setup.py, it’ll ask you what you want to do and you can reply 2 to test out the demo, it will download everything required and then you can run the demo_app in the demo subfolder to test out the model!
[also on macOS the webpage wont load on safari, IDK why but it is what it is, and you will need a decently sized dedicated gpu for the pass, I assume ~8GB ram shall do the trick, any MacBook or Mac mini with apple silicon shall work… at least I hope soo…]

AAAAA, 1 minute left!!!!!

animesh_varma

Whewww, the final night before the deadline!

I just finished running the absolute final physics metrics on the new 50K-sigma model I spent the last 4 days training. And guess what? It turned into a complete coward!

Let me explain… To fix the “gravity bias” I talked about in the last log, I trained a new model that completely ignored angular orientation and used standard sigma, not my (potentially) novel crescent curriculum. It turns out, by removing that penalty, the network just took the easy way out. It parked its prediction right in the geometric center of the box and refused to move. It played it way too safe to actually track the physics, also while the actual avg prediction was parked in the center the heatmap covered the entire object edge to edge making the prediction pointless (img attached).

But here is the crazy part: when I ran a custom “Physics Capture” script to evaluate the vector distances, my original 50K Crescent model emerged as the undisputed champion. Despite the occasional bimodal split, it actually pushed its prediction away from the geometric center and consistently tracked the true hidden mass. It earned its accuracy through real physically grounded directionality, not just guessing the middle or flailing wildly like the baselines! [even though the 1K crescent and the resent baseline beat it there they suffered very badly in other metrics making the 50K-crescent only the one to service “well enough on all of them”, and of course its predictions look so cool and dynamic :)]

So, the original 50K Crescent model is officially the SOTA (it is now but I don’t know for sure still, dont know what I uncover when writing the final paper)! I am spending tonight doing a massive cleanup of my GitHub repo [deleting gigabytes of local logs, fixing hardcoded paths, and backing up all the weights and datasets]. I’ll whip up the final README tomorrow morning and hit that SHIP button right before the Flavortown deadline (I am in IST so adjust that accordingly).

See you on the other side!
[again the attached img is of the sigma 50K model just guessing the entire box just to be safe…]

Attachment
0
animesh_varma

Whewww, 23h 12m logged! Truth be told I didn’t believe that I would make STATERA presentable in time before flavortown ended (less than 2 days left yikes), but here we are, with a single devlog for 23h 12m no less (as I said I never expected it to be completed in time)!

So what is this massive project? STATERA (Spatio-Temporal Analysis of Tensor Embeddings for Rigid-body Asymmetry) is basically a zero-shot physics engine inside a neural network. It watches an opaque box tumble in the air and mathematically calculates exactly where its hidden internal center of mass (CoM) is, just from the raw video.

What have I been doing for 23 hours? Fighting physics and neural networks. I generated 50,000 simulated tumbling videos to train a frozen V-JEPA vision model on my single RTX 5070 Ti. I had to build a custom 2.5D decoder and a mathematical extraction pipeline using OpenCV just to track the physics accurately. (and yes obv the training time is not in the hours of work, hackatime sends heart beats only when you’re actively coding… the cumulative training time including all the ablations is well over 2 weeks!)

The biggest challenge yet? The model actually overfit to gravity! It learned a “Settling-State Bias” where it realized heavy things usually face the floor when they land in the simulation. So when I tested it in the real world in my bedroom and the box bounced with the heavy side UP, the network’s brain literally split in half trying to decide between the visual bounce and its gravity bias[I call this Visual-Kinematic Aliasing for now]. Fixing this math by changing the training curriculum to a phase-agnostic gaussian dot took some time (but nothing compared to the time it took to come up with the previous 50K curriculum which failed with the bias. It used a crescent target that forced the model to learn angular orientation, which accidentally taught it the gravity trick)!

What’s next? The final 50K model is finishing its run tonight around 9pm. While that finishes, I am going to build a slick local web UI for the demo so you guys can just upload a video and watch the physics tracking happen live. See you in the next devlog! [the attached video shows the last 50k model splitting up its prediction as it fights internally between the face it thinks is correct from observation and the one which is facing down which it learned to be correct most of the time. The crosshair is the actual ground truth from my tracking and you can see where the model thinks the CoM is from the heatmap density! I have 76 such videos as my real-world validation test. Also did I mention I would be releasing a 50K benchmark for other people who want to test on my baseline? I do also believe that the accuracy will scale beautifully with more compute, which I simply don’t have right now (fun fact, each 50K run takes around 4.5 days to complete and a 1K run around 6-9 hours on my 5070ti i.e., also the 50K dataset alone is over 200GB!)]
(also the first one it where it splits and the wrong face wins, and in the second one, the correct one was selected, you might also think it is guessing which face and is bound to get it correct 50% of the time, but the other 74 runs not shared here make me believe otherwise…)

0