How Ironworld works
Any video gets reconstructed in 3D by Pi 3X, OWLv2 finds common objects in each frame, SigLIP handles abstract queries like "the dirtiest area," and Gemma 4 orchestrates the whole loop through a small toolkit of geometric operations.
flowchart LR
V["πΉ User video
any handheld pan"]
subgraph ENV ["3D ENVIRONMENT Β· scene memory"]
direction TB
P["π¦ Pi 3X
feed-forward 3D reconstruction"]
O["π² OWLv2
common-object detection"]
S["π SigLIP
open-vocab cluster embeddings"]
end
G["π§ Gemma 4
agentic orchestrator"]
Q["π¬ User question
'Where is the dirtiest area?'"]
A["π― Grounded answer
3D region + source frame + bbox"]
V --> ENV
Q --> G
G <-. tool calls .-> ENV
G --> A
The math, in plain English
When OWLv2's detections come back, each 3D point gets a vote from every detection it falls inside. A point only becomes confident in a label when multiple frames agree.
When the agent commits to a candidate region, it scores every source frame for how clearly that region shows up β coverage C, projected framing A, view angle V, occlusion O, and image sharpness Q. The winning frame is what the user sees in the chat with the bounding box drawn back onto the original footage.
If the agent is still unsure, it picks the next viewpoint that should resolve the most uncertainty about the user's query β balancing what it'd learn against how far the figurine has to move.