Ironworld

a 3-D scene memory that listens to what workers say

Jie Kai Tao

Linwei Zhang

University of Florida · April 2026

In one paragraph. Ironworld watches a short helmet-cam clip of a construction site and builds a 3-D map of everything in it — workers, ladders, tools, the crane — and remembers where each thing is over time. When someone speaks ("lay the next block on this row"), the system tries to figure out which object they're referring to. On a 60-second masonry clip with eleven planted spoken lines (nine job-related, two unrelated chit-chat), it correctly linked all 9 work-related lines to a real object on the site, and correctly ignored both pieces of small talk.

1 · The problem

A construction-site video shows a lot at once: workers, machines, hand tools, materials, scaffolding. A simple object detector can list those things, but a useful safety tool needs to answer a harder question — which thing is a worker acting on, and when?

Speech helps, but it's noisy. A foreman might say "lay the next block on this row" while pointing at one block out of dozens. Another worker asks about lunch. Hammering blurs both. The goal is to use speech when it carries real action signal, and quietly drop it when it doesn't.

The 60-second input clip — masonry production from a worker's helmet camera, with eleven planted spoken lines added to the audio track. Two of those lines are deliberately off-task (asking about the game last night; asking about lunch).

2 · How it works, in plain words

The pipeline runs in four stages. Each stage feeds the next.

Build a 3-D model of the scene. A handful of frames from the clip go through VGGT [1], a feed-forward 3-D reconstruction model. The output is a coloured point cloud — a 3-D snapshot of the site you can rotate and pan around — plus the camera's path through it.
Find every object in every frame. An open-vocabulary detector (OWLv2 [2]) gets a list of construction-domain words ("worker", "ladder", "cinder block", "tower crane", "fire extinguisher", and ~40 others) and flags everything matching them in each frame.
Stitch the detections into persistent entities. Each 2-D box is placed at its real 3-D location using the depth from stage 1. A tracker then groups detections of the same object across frames into a single "entity" with a stable identity, a position over time, and a confidence that it actually exists. Things seen only once or twice get pruned.
Match speech to entities. When a spoken line lands at a particular timestamp, the system scores every tracked entity against it on five things: how well the spoken noun matches what the entity is, how close in time, how close to the worker's hand, how much it's moving, and whether the verb makes sense for that kind of object. The best-scoring match is the attribution. If nothing in the scene matches the noun, the line is dropped as off-task.

Scene 01 mid-frame — Roughly the middle of the clip. The live player renders the coloured point cloud and a small radar cone showing where the helmet camera is pointing as you scrub through the video.

3 · What it produced for this clip

After stage 3, the scene settled on 32 stable entities — 4 workers, 5 cranes (3 of them clearly tower cranes), 4 ladders, scaffold, a hammer, a saw, a fire extinguisher, a bucket, and a long tail of materials.

32 stable entities 4 workers 5 cranes 4 ladders 1 fire extinguisher

The most useful behaviour is in stage 4, where speech is matched to objects. Two examples — one job-related, one chit-chat — make it concrete.

Job-related line: "Lay the next block on this row" at 8 s.

match candidate concrete block (existence 0.93, ~1.6 m from hand) noun match "block" → concrete block ✓ timing lined up with the spoken line ✓ verb fit "lay" is a placing verb ✓ → attributed to that block

Off-task line: "Did you see the game last night?" at 33 s.

noun match "game" / "night" — nothing in the scene matches → filtered out (no on-site object fits)

4 · Results

9 of 9 job-related spoken lines were linked to a real object in the scene. In 8 of those 9, the object's class matched the noun the worker actually said; the ninth ("this last one") had no specific noun and was resolved to a nearby block by context.
2 of 2 off-task lines (sports talk, lunch) were correctly dropped — no entity in the scene had a class matching the spoken noun, so nothing got attributed.
The full clip — 3-D reconstruction, detection, tracking, and speech matching — finishes in well under two minutes end-to-end.

9/9 work lines attributed 2/2 off-task lines dropped 120k-point 3-D cloud

5 · What this doesn't do yet

Faces are blurred in the source footage. The dataset anonymises worker faces, so we don't try to verify hardhat or vest compliance from the video. The system's safety reports phrase findings as things to check ("verify X"), not as confirmed violations.
Distances are scene-relative, not metric. The 3-D reconstruction doesn't know how big a metre is, so all distances are "roughly twice as far as that other thing" rather than "1.8 m". Anything that needs real units (e.g. a crane swing radius) would need a separate calibration step.
The detection threshold is hand-picked. The cutoff that decides "yes, that's a ladder" was chosen by eye on this footage. A learned, per-class cutoff would probably catch a few more rare items.
One short clip is a narrow test. Longer footage where workers leave and come back, or scenes that cut between locations, would stress the tracker in ways this evaluation doesn't.
The audio is synthesised. The source clip has no audio track, so the spoken lines were generated and mixed onto procedural site ambience. Real microphone audio with real speech-recognition errors is the next milestone.

References

J. Wang et al., "VGGT: Visual Geometry Grounded Transformer," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.
M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling Open-Vocabulary Object Detection," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023.