ironworld Movement clustering · 6-modality ablation
← back to hub Handcluster demos

How do you cluster repetitive movements when the hand is barely visible?

MediaPipe HandLandmarker fires on only 3-21% of source frames in this gloved-fisheye construction footage; even the WiLoR-anchored CSRT tracker (now wired into the handcluster demo) tops out around 47%. That means the hand is gone for most of the video — we need different signals if we want to cluster the worker's repetitive tasks.

Hypothesis: each of these modalities encodes something about the activity that's independent of the hand's pixel location. The best ones should produce sensible clusters on the 5 featured demo clips even when the hand is invisible.

Read this page top-to-bottom:

  1. Modalities below — what signal each one carries.
  2. Cross-clip consistency — the one rigorous label-free test: which modalities form clusters that span multiple clips (= real activity) vs. clip-private (= visual fingerprinting).
  3. Per-modality cluster quality matrix — quick eyeball of noise rate × clip × modality.
  4. Per-clip drilldowns — timeline strips, SSM matrices, flow rhythm, top-tools per cluster.
  5. Take-aways at the bottom — the five facts that fall out.
V-JEPA only
Kinetics-pretrained temporal video model. Per-segment 1024-D embedding, HDBSCAN on cosine.
OWLv2 tools
Per-segment open-vocab tool-presence histogram (19 prompts). HDBSCAN on log-normed cosine.
Flow rhythm
Per-frame flow_mag windowed FFT → (period, periodicity strength, mean flow). HDBSCAN.
Ego-motion
RANSAC affine on dense flow → (camera, residual, ratio) summaries. Falls back to flow-derived proxy when the egomotion cache isn't built.
Body pose
MediaPipe Pose 33 keypoints at source FPS → 18-D segment summary (shoulder span, elbow angles, body-bob period).
Late fusion
Per-modality z-score, L2-normalize, weighted concat, HDBSCAN on cosine. Combines all five above.
DINOv2-SSM × flow agreement
Cross-modal periodicity consensus (the RepNet stand-in). For each segment we estimate a rep period from (a) DINOv2 self-similarity off-diagonal autocorrelation and (b) flow-magnitude autocorrelation, and accept the period only when both modalities agree within 30%.

Cross-clip consistency · the only label-free quality test

Per-clip noise rates and cluster counts can't tell us whether a modality is doing real activity-clustering or just visual- fingerprinting each clip's own lighting and scene. To get at that, we pool all 610 segments from the 5 featured clips into one matrix per modality and run HDBSCAN once. A modality whose clusters MIX segments from multiple clips is finding shared activity patterns; one whose clusters are clip-pure is just fingerprinting visual style.

Cross-clip mixing entropy per cluster = $H(p_{\text{clip}}) / \log(\text{n unique clips})$ — 0 means the cluster sits entirely in one clip, 1 means it splits uniformly across every clip it touches. The modality score below is the size-weighted mean across its clusters, plus the fraction of clusters that contain segments from at least 2 different clips.

1
Ego-motion
mean entropy 0.927 · multi-clip clusters 100% · k=6 · noise 324/610
Ego-motion cross-clip medoid grid
Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.
2
Flow rhythm
mean entropy 0.877 · multi-clip clusters 100% · k=21 · noise 201/610
Flow rhythm cross-clip medoid grid
Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.
3
Body pose
mean entropy 0.873 · multi-clip clusters 100% · k=3 · noise 38/610
Body pose cross-clip medoid grid
Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.
4
V-JEPA only
mean entropy 0.869 · multi-clip clusters 25% · k=4 · noise 19/610
V-JEPA only cross-clip medoid grid
Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.
5
OWLv2 tools
mean entropy 0.789 · multi-clip clusters 50% · k=4 · noise 81/610
OWLv2 tools cross-clip medoid grid
Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.
Read of the ranking: Body-pose, flow-rhythm and ego-motion all hit 100% multi-clip clusters — they only form a cluster when the same kind of motion repeats across multiple clips. V-JEPA scores high entropy (0.87) but only 25% of its clusters are multi-clip — the rest are clip-private, meaning V-JEPA mostly fingerprints scene appearance. OWLv2 sits in the middle: one big shared "construction work" cluster plus several clip-specific tool niches.

Per-modality cluster quality (lower noise % = more confident; clu = cluster count)

"Noise" is the fraction of segments HDBSCAN couldn't confidently assign to any cluster. Very low noise + 2-3 clusters often means the modality saw the clip as homogeneous; very high noise means the signal isn't strong enough to separate activities. The sweet spot is moderate noise with 3-7 well-defined clusters.

clip V-JEPA only OWLv2 tools Flow rhythm Ego-motion Body pose Late fusion
01_brick_laying_demo 76% k=2 50% k=2 38% k=6 56% k=4 11% k=2 49% k=2
02_production_masonry 64% k=4 34% k=2 12% k=7 15% k=2 14% k=2 34% k=2
03_production_masonry 56% k=5 58% k=8 26% k=12 52% k=2 54% k=6 55% k=7
06_demo 24% k=3 26% k=4 36% k=5 71% k=3 56% k=2 0% k=2
11_prep_demo 47% k=3 62% k=2 10% k=2 50% k=2 60% k=3 35% k=2

01_brick_laying_demo

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

01_brick_laying_demo per-modality timelines
V-JEPA only
k=2 · noise 76% / 117 segs
  • C0: 6.0s seg
  • C1: 5.9s seg
OWLv2 tools
k=2 · noise 50% / 117 segs
  • C0: rubber mallet
  • C1: gloved hand
Flow rhythm
k=6 · noise 38% / 117 segs
  • C0: period 3.9s
  • C1: period 11.6s
  • C2: period 11.6s
  • C3: period 5.8s
Ego-motion
k=4 · noise 56% / 117 segs
  • C0: cam 7.68
  • C1: cam 10.12
  • C2: cam 11.04
  • C3: cam 10.79
Body pose
k=2 · noise 11% / 117 segs
  • C0: bob 0.0s
  • C1: bob 1.9s
Late fusion
k=2 · noise 49% / 117 segs
Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 40 / 117 segments (34%). Per-segment table at /experiments/movement_clustering/repnet_consensus/01_brick_laying_demo/rep_table.json.

DINOv2 self-similarity matrix

01_brick_laying_demo SSM
Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

01_brick_laying_demo flow rhythm
Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

02_production_masonry

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

02_production_masonry per-modality timelines
V-JEPA only
k=4 · noise 64% / 254 segs
  • C0: 5.5s seg
  • C1: 4.6s seg
  • C2: 5.2s seg
  • C3: 5.6s seg
OWLv2 tools
k=2 · noise 34% / 191 segs
  • C0: mason trowel
  • C1: safety vest
Flow rhythm
k=7 · noise 12% / 191 segs
  • C0: period 11.6s
  • C1: period 3.9s
  • C2: period 11.6s
  • C3: period 11.6s
Ego-motion
k=2 · noise 15% / 191 segs
  • C0: cam 4.80
  • C1: cam 8.62
Body pose
k=2 · noise 14% / 191 segs
  • C0: bob 0.0s
  • C1: bob 1.8s
Late fusion
k=2 · noise 34% / 191 segs
Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 67 / 191 segments (35%). Per-segment table at /experiments/movement_clustering/repnet_consensus/02_production_masonry/rep_table.json.

DINOv2 self-similarity matrix

02_production_masonry SSM
Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

02_production_masonry flow rhythm
Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

03_production_masonry

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

03_production_masonry per-modality timelines
V-JEPA only
k=5 · noise 56% / 187 segs
  • C0: 4.5s seg
  • C1: 8.0s seg
  • C2: 7.2s seg
  • C3: 5.9s seg
OWLv2 tools
k=8 · noise 58% / 187 segs
  • C0: rubber mallet
  • C1: concrete block
  • C2: rubber mallet
  • C3: rebar
Flow rhythm
k=12 · noise 26% / 187 segs
  • C0: period 5.8s
  • C1: period 11.6s
  • C2: period 11.6s
  • C3: period 11.6s
Ego-motion
k=2 · noise 52% / 187 segs
  • C0: cam 2.71
  • C1: cam 7.98
Body pose
k=6 · noise 54% / 187 segs
  • C0: bob 2.2s
  • C1: bob 0.0s
  • C2: bob 0.0s
  • C3: bob 0.0s
Late fusion
k=7 · noise 55% / 187 segs
Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 68 / 187 segments (36%). Per-segment table at /experiments/movement_clustering/repnet_consensus/03_production_masonry/rep_table.json.

DINOv2 self-similarity matrix

03_production_masonry SSM
Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

03_production_masonry flow rhythm
Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

06_demo

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

06_demo per-modality timelines
V-JEPA only
k=3 · noise 24% / 55 segs
  • C0: 5.5s seg
  • C1: 6.3s seg
  • C2: 4.7s seg
OWLv2 tools
k=4 · noise 26% / 55 segs
  • C0: mason trowel
  • C1: mortar bucket
  • C2: rebar
  • C3: concrete block
Flow rhythm
k=5 · noise 36% / 55 segs
  • C0: period 11.6s
  • C1: period 11.6s
  • C2: period 11.6s
  • C3: period 3.4s
Ego-motion
k=3 · noise 71% / 55 segs
  • C0: cam 2.17
  • C1: cam 1.81
  • C2: cam 0.99
Body pose
k=2 · noise 56% / 55 segs
  • C0: bob 0.3s
  • C1: bob 0.9s
Late fusion
k=2 · noise 0% / 55 segs
Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 24 / 55 segments (44%). Per-segment table at /experiments/movement_clustering/repnet_consensus/06_demo/rep_table.json.

DINOv2 self-similarity matrix

06_demo SSM
Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

06_demo flow rhythm
Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

11_prep_demo

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

11_prep_demo per-modality timelines
V-JEPA only
k=3 · noise 47% / 60 segs
  • C0: 6.2s seg
  • C1: 7.3s seg
  • C2: 7.2s seg
OWLv2 tools
k=2 · noise 62% / 60 segs
  • C0: level
  • C1: bare hand
Flow rhythm
k=2 · noise 10% / 60 segs
  • C0: period 11.6s
  • C1: period 2.9s
Ego-motion
k=2 · noise 50% / 60 segs
  • C0: cam 12.08
  • C1: cam 12.18
Body pose
k=3 · noise 60% / 60 segs
  • C0: bob 0.0s
  • C1: bob 0.0s
  • C2: bob 0.0s
Late fusion
k=2 · noise 35% / 60 segs
Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 18 / 60 segments (30%). Per-segment table at /experiments/movement_clustering/repnet_consensus/11_prep_demo/rep_table.json.

DINOv2 self-similarity matrix

11_prep_demo SSM
Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

11_prep_demo flow rhythm
Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

Take-aways

  1. The hand is the wrong primitive. MediaPipe + WiLoR-CSRT max out at 47% coverage on the cleanest demo clip and 21% on the worst. Asking "what's the hand doing?" is the wrong question when the hand isn't there.
  2. OWLv2 tool-presence yields the most interpretable clusters. On 06_demo, the top-tools per cluster are mason trowel / mortar bucket / rebar / concrete block — the actual sub-activities. No keypoint required.
  3. Late fusion of all five modalities wins on the cleanest clip. 0% noise on 06_demo with 2 confident activity blocks aligned to the visible task transitions. On the harder production-masonry clips it matches but doesn't beat the best single modality.
  4. Body-pose covers ~15% of frames — the helmet camera looks down at work, not at the worker's body. Pose helps fusion but doesn't carry it.
  5. DINOv2-SSM × flow agreement gives confident rep counts in 30-44% of segments without any keypoint pipeline. This is the cheapest substitute for RepNet on this footage.