Ironworld
1 ยท The problem
A construction-site video shows a lot at once: workers, machines, hand tools, materials, scaffolding. A simple object detector can list those things, but a useful safety tool needs to answer a harder question โ which thing is a worker acting on, and when?
Speech helps, but it's noisy. A foreman might say "lay the next block on this row" while pointing at one block out of dozens. Another worker asks about lunch. Hammering blurs both. The goal is to use speech when it carries real action signal, and quietly drop it when it doesn't.
2 ยท How it works, in plain words
The pipeline runs in four stages. Each stage feeds the next.
- Build a 3-D model of the scene. A handful of frames from the clip go through VGGT [1], a feed-forward 3-D reconstruction model. The output is a coloured point cloud โ a 3-D snapshot of the site you can rotate and pan around โ plus the camera's path through it.
- Find every object in every frame. An open-vocabulary detector (OWLv2 [2]) gets a list of construction-domain words ("worker", "ladder", "cinder block", "tower crane", "fire extinguisher", and ~40 others) and flags everything matching them in each frame.
- Stitch the detections into persistent entities. Each 2-D box is placed at its real 3-D location using the depth from stage 1. A tracker then groups detections of the same object across frames into a single "entity" with a stable identity, a position over time, and a confidence that it actually exists. Things seen only once or twice get pruned.
- Match speech to entities. When a spoken line lands at a particular timestamp, the system scores every tracked entity against it on five things: how well the spoken noun matches what the entity is, how close in time, how close to the worker's hand, how much it's moving, and whether the verb makes sense for that kind of object. The best-scoring match is the attribution. If nothing in the scene matches the noun, the line is dropped as off-task.
3 ยท What it produced for this clip
After stage 3, the scene settled on 32 stable entities โ 4 workers, 5 cranes (3 of them clearly tower cranes), 4 ladders, scaffold, a hammer, a saw, a fire extinguisher, a bucket, and a long tail of materials.
The most useful behaviour is in stage 4, where speech is matched to objects. Two examples โ one job-related, one chit-chat โ make it concrete.
Job-related line: "Lay the next block on this row" at 8 s.
Off-task line: "Did you see the game last night?" at 33 s.
4 ยท Results
- 9 of 9 job-related spoken lines were linked to a real object in the scene. In 8 of those 9, the object's class matched the noun the worker actually said; the ninth ("this last one") had no specific noun and was resolved to a nearby block by context.
- 2 of 2 off-task lines (sports talk, lunch) were correctly dropped โ no entity in the scene had a class matching the spoken noun, so nothing got attributed.
- The full clip โ 3-D reconstruction, detection, tracking, and speech matching โ finishes in well under two minutes end-to-end.
5 ยท What this doesn't do yet
- Faces are blurred in the source footage. The dataset anonymises worker faces, so we don't try to verify hardhat or vest compliance from the video. The system's safety reports phrase findings as things to check ("verify X"), not as confirmed violations.
- Distances are scene-relative, not metric. The 3-D reconstruction doesn't know how big a metre is, so all distances are "roughly twice as far as that other thing" rather than "1.8 m". Anything that needs real units (e.g. a crane swing radius) would need a separate calibration step.
- The detection threshold is hand-picked. The cutoff that decides "yes, that's a ladder" was chosen by eye on this footage. A learned, per-class cutoff would probably catch a few more rare items.
- One short clip is a narrow test. Longer footage where workers leave and come back, or scenes that cut between locations, would stress the tracker in ways this evaluation doesn't.
- The audio is synthesised. The source clip has no audio track, so the spoken lines were generated and mixed onto procedural site ambience. Real microphone audio with real speech-recognition errors is the next milestone.
References
- J. Wang et al., "VGGT: Visual Geometry Grounded Transformer," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.
- M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling Open-Vocabulary Object Detection," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023.