ironworld
π Δ

World Reconstruction for ML-based Construction Analytics

For the Hacktech x Ironsite Challenge, we experimented with, iterated over, and compiled use cases based on recent computer vision advancements.

Ironworld scene memory

Ask questions about narrated construction site footage and get automated safety analysis
Live 14 scenes 32 OSHA citations
— — — Additional Experiments — — —

HandClust · egocentric task discovery

Automatically discovers and groups similar work activities from first-person construction video. Per-clip activity clustering with WiLoR-augmented hand-skeleton overlay, Gemini cluster labels, and a 3-D embedding view.
5 clips 1800 segments 21-47% hand cov

Hand + tool segmentation

Zero-shot text-prompt segmentation of hands (gloved or bare), PPE, and held tools on construction footage. Compares SAM 3.1 image, SAM 3.1 multiplex video tracker, OWLv2 → SAM 2.1, and MediaPipe → SAM 2.1.
SAM 3.1 multiplex 50 frames 6 s tracking clip

Movement clustering · 6-modality ablation

When the hand is gone for 70-90% of frames, what else can we cluster on? Side-by-side comparison of V-JEPA, OWLv2 tool-presence, optical-flow rhythm, ego-motion, body-pose, and a late-fusion baseline across all 5 featured demo clips.
6 modalities 5 clips RepNet stand-in

3D-recon shootout · VGGT vs alternatives

Same 32 frames per clip, fed through ~20 feed-forward / SLAM / SfM / Gaussian / mono-depth methods. Side-by-side point clouds, runtimes, and quality. The headline question: is anything actually better than VGGT for ironsite footage?
3 clips ~20 methods Plotly 3D