ironworld
โ–ถ Video demos ๐Ÿ“„ Report

Ironworld

a 3-D scene memory that listens to what workers say
Jie Kai Tao
jietao@ufl.edu
Linwei Zhang
linwei.zhang@ufl.edu
University of Florida ยท April 2026
In one paragraph. Ironworld watches a short helmet-cam clip of a construction site and builds a 3-D map of everything in it โ€” workers, ladders, tools, the crane โ€” and remembers where each thing is over time. When someone speaks ("lay the next block on this row"), the system tries to figure out which object they're referring to. On a 60-second masonry clip with eleven planted spoken lines (nine job-related, two unrelated chit-chat), it correctly linked all 9 work-related lines to a real object on the site, and correctly ignored both pieces of small talk.

1 ยท The problem

A construction-site video shows a lot at once: workers, machines, hand tools, materials, scaffolding. A simple object detector can list those things, but a useful safety tool needs to answer a harder question โ€” which thing is a worker acting on, and when?

Speech helps, but it's noisy. A foreman might say "lay the next block on this row" while pointing at one block out of dozens. Another worker asks about lunch. Hammering blurs both. The goal is to use speech when it carries real action signal, and quietly drop it when it doesn't.

The 60-second input clip โ€” masonry production from a worker's helmet camera, with eleven planted spoken lines added to the audio track. Two of those lines are deliberately off-task (asking about the game last night; asking about lunch).

2 ยท How it works, in plain words

The pipeline runs in four stages. Each stage feeds the next.

  1. Build a 3-D model of the scene. A handful of frames from the clip go through VGGT [1], a feed-forward 3-D reconstruction model. The output is a coloured point cloud โ€” a 3-D snapshot of the site you can rotate and pan around โ€” plus the camera's path through it.
  2. Find every object in every frame. An open-vocabulary detector (OWLv2 [2]) gets a list of construction-domain words ("worker", "ladder", "cinder block", "tower crane", "fire extinguisher", and ~40 others) and flags everything matching them in each frame.
  3. Stitch the detections into persistent entities. Each 2-D box is placed at its real 3-D location using the depth from stage 1. A tracker then groups detections of the same object across frames into a single "entity" with a stable identity, a position over time, and a confidence that it actually exists. Things seen only once or twice get pruned.
  4. Match speech to entities. When a spoken line lands at a particular timestamp, the system scores every tracked entity against it on five things: how well the spoken noun matches what the entity is, how close in time, how close to the worker's hand, how much it's moving, and whether the verb makes sense for that kind of object. The best-scoring match is the attribution. If nothing in the scene matches the noun, the line is dropped as off-task.
Scene 01 mid-frame
Roughly the middle of the clip. The live player renders the coloured point cloud and a small radar cone showing where the helmet camera is pointing as you scrub through the video.

3 ยท What it produced for this clip

After stage 3, the scene settled on 32 stable entities โ€” 4 workers, 5 cranes (3 of them clearly tower cranes), 4 ladders, scaffold, a hammer, a saw, a fire extinguisher, a bucket, and a long tail of materials.

32 stable entities 4 workers 5 cranes 4 ladders 1 fire extinguisher

The most useful behaviour is in stage 4, where speech is matched to objects. Two examples โ€” one job-related, one chit-chat โ€” make it concrete.

Job-related line: "Lay the next block on this row" at 8 s.

match candidate concrete block (existence 0.93, ~1.6 m from hand) noun match "block" โ†’ concrete block โœ“ timing lined up with the spoken line โœ“ verb fit "lay" is a placing verb โœ“ โ†’ attributed to that block

Off-task line: "Did you see the game last night?" at 33 s.

noun match "game" / "night" โ€” nothing in the scene matches โ†’ filtered out (no on-site object fits)

4 ยท Results

9/9 work lines attributed 2/2 off-task lines dropped 120k-point 3-D cloud

5 ยท What this doesn't do yet

References

  1. J. Wang et al., "VGGT: Visual Geometry Grounded Transformer," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.
  2. M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling Open-Vocabulary Object Detection," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023.