How do you cluster repetitive movements when the hand is barely visible?
MediaPipe HandLandmarker fires on only 3-21% of source frames in this
gloved-fisheye construction footage; even the WiLoR-anchored CSRT
tracker (now wired into the
handcluster demo) tops out around
47%. That means the hand is gone for most of the video — we
need different signals if we want to cluster the worker's repetitive
tasks.
Hypothesis: each of these modalities encodes something about
the activity that's independent of the hand's pixel location. The
best ones should produce sensible clusters on the 5 featured demo
clips even when the hand is invisible.
Read this page top-to-bottom:
- Modalities below — what signal each one carries.
- Cross-clip consistency — the one rigorous label-free test: which modalities form clusters that span multiple clips (= real activity) vs. clip-private (= visual fingerprinting).
- Per-modality cluster quality matrix — quick eyeball of noise rate × clip × modality.
- Per-clip drilldowns — timeline strips, SSM matrices, flow rhythm, top-tools per cluster.
- Take-aways at the bottom — the five facts that fall out.
V-JEPA only
Kinetics-pretrained temporal video model. Per-segment 1024-D embedding, HDBSCAN on cosine.
OWLv2 tools
Per-segment open-vocab tool-presence histogram (19 prompts). HDBSCAN on log-normed cosine.
Flow rhythm
Per-frame flow_mag windowed FFT → (period, periodicity strength, mean flow). HDBSCAN.
Ego-motion
RANSAC affine on dense flow → (camera, residual, ratio) summaries. Falls back to flow-derived proxy when the egomotion cache isn't built.
Body pose
MediaPipe Pose 33 keypoints at source FPS → 18-D segment summary (shoulder span, elbow angles, body-bob period).
Late fusion
Per-modality z-score, L2-normalize, weighted concat, HDBSCAN on cosine. Combines all five above.
DINOv2-SSM × flow agreement
Cross-modal periodicity consensus (the RepNet stand-in). For each
segment we estimate a rep period from (a) DINOv2 self-similarity
off-diagonal autocorrelation and (b) flow-magnitude
autocorrelation, and accept the period only when both modalities
agree within 30%.
Cross-clip consistency · the only label-free quality test
Per-clip noise rates and cluster counts can't tell us whether a
modality is doing real activity-clustering or just visual-
fingerprinting each clip's own lighting and scene. To get at that,
we pool all 610
segments from the 5 featured clips into one matrix per
modality and run HDBSCAN once. A modality whose clusters MIX
segments from multiple clips is finding shared activity patterns;
one whose clusters are clip-pure is just fingerprinting visual
style.
Cross-clip mixing entropy per cluster = $H(p_{\text{clip}}) / \log(\text{n unique clips})$
— 0 means the cluster sits entirely in one clip, 1 means it splits
uniformly across every clip it touches. The modality score below is
the size-weighted mean across its clusters, plus the fraction of
clusters that contain segments from at least 2 different
clips.
1
Ego-motion
mean entropy 0.927 ·
multi-clip clusters 100% ·
k=6 · noise 324/610
Each row is one cross-clip cluster (top 6 by entropy).
Each column is a source clip. The thumbnail is the medoid
segment of that cluster from that clip. Empty cells = the
cluster doesn't contain segments from that clip.
2
Flow rhythm
mean entropy 0.877 ·
multi-clip clusters 100% ·
k=21 · noise 201/610
Each row is one cross-clip cluster (top 6 by entropy).
Each column is a source clip. The thumbnail is the medoid
segment of that cluster from that clip. Empty cells = the
cluster doesn't contain segments from that clip.
3
Body pose
mean entropy 0.873 ·
multi-clip clusters 100% ·
k=3 · noise 38/610
Each row is one cross-clip cluster (top 6 by entropy).
Each column is a source clip. The thumbnail is the medoid
segment of that cluster from that clip. Empty cells = the
cluster doesn't contain segments from that clip.
4
V-JEPA only
mean entropy 0.869 ·
multi-clip clusters 25% ·
k=4 · noise 19/610
Each row is one cross-clip cluster (top 6 by entropy).
Each column is a source clip. The thumbnail is the medoid
segment of that cluster from that clip. Empty cells = the
cluster doesn't contain segments from that clip.
5
OWLv2 tools
mean entropy 0.789 ·
multi-clip clusters 50% ·
k=4 · noise 81/610
Each row is one cross-clip cluster (top 6 by entropy).
Each column is a source clip. The thumbnail is the medoid
segment of that cluster from that clip. Empty cells = the
cluster doesn't contain segments from that clip.
Read of the ranking: Body-pose, flow-rhythm and
ego-motion all hit 100% multi-clip clusters — they only form a
cluster when the same kind of motion repeats across multiple clips.
V-JEPA scores high entropy (0.87) but only 25% of its clusters are
multi-clip — the rest are clip-private, meaning V-JEPA mostly
fingerprints scene appearance. OWLv2 sits in the middle: one big
shared "construction work" cluster plus several clip-specific tool
niches.
Per-modality cluster quality (lower noise % = more confident; clu = cluster count)
"Noise" is the fraction of segments HDBSCAN couldn't confidently
assign to any cluster. Very low noise + 2-3 clusters often means the
modality saw the clip as homogeneous; very high noise means the
signal isn't strong enough to separate activities. The sweet spot is
moderate noise with 3-7 well-defined clusters.
| clip |
V-JEPA only |
OWLv2 tools |
Flow rhythm |
Ego-motion |
Body pose |
Late fusion |
| 01_brick_laying_demo |
76%
k=2
|
50%
k=2
|
38%
k=6
|
56%
k=4
|
11%
k=2
|
49%
k=2
|
| 02_production_masonry |
64%
k=4
|
34%
k=2
|
12%
k=7
|
15%
k=2
|
14%
k=2
|
34%
k=2
|
| 03_production_masonry |
56%
k=5
|
58%
k=8
|
26%
k=12
|
52%
k=2
|
54%
k=6
|
55%
k=7
|
| 06_demo |
24%
k=3
|
26%
k=4
|
36%
k=5
|
71%
k=3
|
56%
k=2
|
0%
k=2
|
| 11_prep_demo |
47%
k=3
|
62%
k=2
|
10%
k=2
|
50%
k=2
|
60%
k=3
|
35%
k=2
|
01_brick_laying_demo
Each strip below shows how a different modality labelled the
segments of this clip on the same time axis. Black = noise (no
cluster assigned). Same color → same cluster within that
strip (colors are not meaningful across strips).
V-JEPA only
k=2 · noise 76% / 117 segs
OWLv2 tools
k=2 · noise 50% / 117 segs
- C0: rubber mallet
- C1: gloved hand
Flow rhythm
k=6 · noise 38% / 117 segs
- C0: period 3.9s
- C1: period 11.6s
- C2: period 11.6s
- C3: period 5.8s
Ego-motion
k=4 · noise 56% / 117 segs
- C0: cam 7.68
- C1: cam 10.12
- C2: cam 11.04
- C3: cam 10.79
Body pose
k=2 · noise 11% / 117 segs
Late fusion
k=2 · noise 49% / 117 segs
Cross-modal rep counting:
DINOv2 SSM and flow autocorrelation agreed on a rep period in
40 / 117
segments (34%). Per-segment
table at
/experiments/movement_clustering/repnet_consensus/01_brick_laying_demo/rep_table.json.
DINOv2 self-similarity matrix
Bright square blocks on the diagonal = distinct
activity periods. Bright off-diagonal stripes at lag k
= repetitive motion at period k/fps.
Flow-rhythm time series
Top: per-frame flow magnitude. Middle: dominant
period from a 12-s sliding FFT. Bottom: periodicity strength
(peak / total power).
02_production_masonry
Each strip below shows how a different modality labelled the
segments of this clip on the same time axis. Black = noise (no
cluster assigned). Same color → same cluster within that
strip (colors are not meaningful across strips).
V-JEPA only
k=4 · noise 64% / 254 segs
- C0: 5.5s seg
- C1: 4.6s seg
- C2: 5.2s seg
- C3: 5.6s seg
OWLv2 tools
k=2 · noise 34% / 191 segs
- C0: mason trowel
- C1: safety vest
Flow rhythm
k=7 · noise 12% / 191 segs
- C0: period 11.6s
- C1: period 3.9s
- C2: period 11.6s
- C3: period 11.6s
Ego-motion
k=2 · noise 15% / 191 segs
Body pose
k=2 · noise 14% / 191 segs
Late fusion
k=2 · noise 34% / 191 segs
Cross-modal rep counting:
DINOv2 SSM and flow autocorrelation agreed on a rep period in
67 / 191
segments (35%). Per-segment
table at
/experiments/movement_clustering/repnet_consensus/02_production_masonry/rep_table.json.
DINOv2 self-similarity matrix
Bright square blocks on the diagonal = distinct
activity periods. Bright off-diagonal stripes at lag k
= repetitive motion at period k/fps.
Flow-rhythm time series
Top: per-frame flow magnitude. Middle: dominant
period from a 12-s sliding FFT. Bottom: periodicity strength
(peak / total power).
03_production_masonry
Each strip below shows how a different modality labelled the
segments of this clip on the same time axis. Black = noise (no
cluster assigned). Same color → same cluster within that
strip (colors are not meaningful across strips).
V-JEPA only
k=5 · noise 56% / 187 segs
- C0: 4.5s seg
- C1: 8.0s seg
- C2: 7.2s seg
- C3: 5.9s seg
OWLv2 tools
k=8 · noise 58% / 187 segs
- C0: rubber mallet
- C1: concrete block
- C2: rubber mallet
- C3: rebar
Flow rhythm
k=12 · noise 26% / 187 segs
- C0: period 5.8s
- C1: period 11.6s
- C2: period 11.6s
- C3: period 11.6s
Ego-motion
k=2 · noise 52% / 187 segs
Body pose
k=6 · noise 54% / 187 segs
- C0: bob 2.2s
- C1: bob 0.0s
- C2: bob 0.0s
- C3: bob 0.0s
Late fusion
k=7 · noise 55% / 187 segs
Cross-modal rep counting:
DINOv2 SSM and flow autocorrelation agreed on a rep period in
68 / 187
segments (36%). Per-segment
table at
/experiments/movement_clustering/repnet_consensus/03_production_masonry/rep_table.json.
DINOv2 self-similarity matrix
Bright square blocks on the diagonal = distinct
activity periods. Bright off-diagonal stripes at lag k
= repetitive motion at period k/fps.
Flow-rhythm time series
Top: per-frame flow magnitude. Middle: dominant
period from a 12-s sliding FFT. Bottom: periodicity strength
(peak / total power).
06_demo
Each strip below shows how a different modality labelled the
segments of this clip on the same time axis. Black = noise (no
cluster assigned). Same color → same cluster within that
strip (colors are not meaningful across strips).
V-JEPA only
k=3 · noise 24% / 55 segs
- C0: 5.5s seg
- C1: 6.3s seg
- C2: 4.7s seg
OWLv2 tools
k=4 · noise 26% / 55 segs
- C0: mason trowel
- C1: mortar bucket
- C2: rebar
- C3: concrete block
Flow rhythm
k=5 · noise 36% / 55 segs
- C0: period 11.6s
- C1: period 11.6s
- C2: period 11.6s
- C3: period 3.4s
Ego-motion
k=3 · noise 71% / 55 segs
- C0: cam 2.17
- C1: cam 1.81
- C2: cam 0.99
Body pose
k=2 · noise 56% / 55 segs
Late fusion
k=2 · noise 0% / 55 segs
Cross-modal rep counting:
DINOv2 SSM and flow autocorrelation agreed on a rep period in
24 / 55
segments (44%). Per-segment
table at
/experiments/movement_clustering/repnet_consensus/06_demo/rep_table.json.
DINOv2 self-similarity matrix
Bright square blocks on the diagonal = distinct
activity periods. Bright off-diagonal stripes at lag k
= repetitive motion at period k/fps.
Flow-rhythm time series
Top: per-frame flow magnitude. Middle: dominant
period from a 12-s sliding FFT. Bottom: periodicity strength
(peak / total power).
11_prep_demo
Each strip below shows how a different modality labelled the
segments of this clip on the same time axis. Black = noise (no
cluster assigned). Same color → same cluster within that
strip (colors are not meaningful across strips).
V-JEPA only
k=3 · noise 47% / 60 segs
- C0: 6.2s seg
- C1: 7.3s seg
- C2: 7.2s seg
OWLv2 tools
k=2 · noise 62% / 60 segs
Flow rhythm
k=2 · noise 10% / 60 segs
- C0: period 11.6s
- C1: period 2.9s
Ego-motion
k=2 · noise 50% / 60 segs
- C0: cam 12.08
- C1: cam 12.18
Body pose
k=3 · noise 60% / 60 segs
- C0: bob 0.0s
- C1: bob 0.0s
- C2: bob 0.0s
Late fusion
k=2 · noise 35% / 60 segs
Cross-modal rep counting:
DINOv2 SSM and flow autocorrelation agreed on a rep period in
18 / 60
segments (30%). Per-segment
table at
/experiments/movement_clustering/repnet_consensus/11_prep_demo/rep_table.json.
DINOv2 self-similarity matrix
Bright square blocks on the diagonal = distinct
activity periods. Bright off-diagonal stripes at lag k
= repetitive motion at period k/fps.
Flow-rhythm time series
Top: per-frame flow magnitude. Middle: dominant
period from a 12-s sliding FFT. Bottom: periodicity strength
(peak / total power).
Take-aways
- The hand is the wrong primitive. MediaPipe + WiLoR-CSRT max out at 47% coverage on the cleanest demo clip and 21% on the worst. Asking "what's the hand doing?" is the wrong question when the hand isn't there.
- OWLv2 tool-presence yields the most interpretable clusters. On
06_demo, the top-tools per cluster are mason trowel / mortar bucket / rebar / concrete block — the actual sub-activities. No keypoint required.
- Late fusion of all five modalities wins on the cleanest clip. 0% noise on
06_demo with 2 confident activity blocks aligned to the visible task transitions. On the harder production-masonry clips it matches but doesn't beat the best single modality.
- Body-pose covers ~15% of frames — the helmet camera looks down at work, not at the worker's body. Pose helps fusion but doesn't carry it.
- DINOv2-SSM × flow agreement gives confident rep counts in 30-44% of segments without any keypoint pipeline. This is the cheapest substitute for RepNet on this footage.