ironworld

HandClust

discovering work activities in helmet-cam construction footage, without labels
Jie Kai Tao
jietao@ufl.edu
Linwei Zhang
linwei.zhang@ufl.edu
University of Florida ยท April 2026
โ–ถ Video demos ๐Ÿ“„ Report
HandClust looks at hours of construction footage filmed from a worker's helmet camera and groups the moments together by what's happening โ€” laying bricks, scooping mortar, walking between piles, idle time. It does this without anyone labelling the video. The labels for each group are then filled in automatically by a vision-language model.

1 ยท The problem

The data is just under five hours of helmet-cam masonry footage spread across 14 clips. Filenames hint at a coarse phase โ€” production, prep, transit, downtime โ€” but nothing in the video itself is labelled. The goal is to discover the activities that actually repeat across the footage and produce a per-clip timeline showing which activity is happening at each moment.

The obvious first idea โ€” track the worker's hands frame by frame โ€” falls apart on this footage. Workers wear gloves, the lens is wide-angle, and hands constantly leave the field of view. An off-the-shelf hand tracker (MediaPipe [3]) only finds a hand in 0.5โ€“12.5 % of frames depending on the clip. Anything that needs hand keypoints is dead on arrival, so the design treats hand pose as a weak hint and leans on visual scene cues instead.

What the data looks like, with the final classifier overlaid. Two minutes of clip 01 (production masonry) with a per-frame activity label and a rolling timeline strip. Notice how often the visible hand is gloved, partly occluded, or completely out of frame.

2 ยท How it works, in plain words

The pipeline runs four stages over each clip. Each stage is unsupervised โ€” no labels go in.

  1. Sample frames at a fixed rate. Each clip is decoded at 5 frames per second. For every sampled frame the system records three things: a feature vector from a strong image model (DINOv2 [1]) that captures the scene, an optical-flow summary that captures how things are moving, and any hand keypoints the tracker happened to find.
  2. Cut the clip into short segments. Three different segment detectors look for "something just changed" โ€” one watches motion, one watches scene appearance, one watches statistical change-points โ€” and their proposals are averaged into a single set of segment boundaries. Using three detectors with different blind spots is steadier than relying on any one.
  3. Describe each segment. Every segment gets a video-level embedding from V-JEPA-2 [2] (a self-supervised video model trained for physical-world understanding) plus the average of the per-frame scene and motion features.
  4. Group segments by similarity. The three feature types are compared and combined so that confident signals can outweigh ambiguous ones. Segments that the combination judges similar end up in the same cluster โ€” these are the discovered activities. A vision-language model (Gemini [5]) then writes a short label for each cluster ("scooping mortar", "climbing scaffold", and so on).
Cluster preview reel โ€” 96 seconds of footage organised into 20 activities, with three example segments per activity. Each title card is the auto-generated label for that cluster.

3 ยท Results

The only ground truth available is the coarse phase label baked into each clip's filename, so that's what the evaluation tests against. Two questions: do the discovered clusters line up with the phase label, and does using all three feature types together actually beat using any one?

2-D layout of segments by phase
A 2-D map of every segment from four clips (1,460 segments โ€” three production clips plus one transit clip), arranged so that similar segments sit close together. Different phases land in cleanly separated regions.

Comparing feature types (5 clips, 1,800 segments, 3 phases)

"Phase score" below is how well the discovered clusters agree with the phase label โ€” 1.0 is perfect agreement, 0 is random.

which features were usedphase score
motion only0.14
scene only (DINOv2)0.77
video only (V-JEPA-2)0.81
motion + scene0.63
motion + scene + video (the full pipeline)0.79

Three readings. First, motion alone is hopeless on this footage (0.14) โ€” hand-pose dynamics aren't the discriminating signal here, scene context is. Second, V-JEPA-2 on its own is the strongest single feature, which matches what its authors argue: a video model trained for physical-world understanding is a good backbone for unsupervised activity discovery. Third, the full combination doesn't lose ground to V-JEPA-2 on phase agreement and produces visibly tighter clusters in the 2-D map above.

4 ยท What this doesn't do yet

References

  1. M. Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," Trans. Mach. Learn. Res., 2024.
  2. M. Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning," arXiv preprint, 2025.
  3. C. Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," arXiv preprint arXiv:1906.08172, 2019.
  4. R. A. Potamias et al., "WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.
  5. Gemini Team, Google, "Gemini: A Family of Highly Capable Multimodal Models," arXiv preprint arXiv:2312.11805, 2023.