Movement clustering

How do you cluster repetitive movements when the hand is barely visible?

MediaPipe HandLandmarker fires on only 3-21% of source frames in this gloved-fisheye construction footage; even the WiLoR-anchored CSRT tracker (now wired into the handcluster demo) tops out around 47%. That means the hand is gone for most of the video — we need different signals if we want to cluster the worker's repetitive tasks.

Hypothesis: each of these modalities encodes something about the activity that's independent of the hand's pixel location. The best ones should produce sensible clusters on the 5 featured demo clips even when the hand is invisible.

Read this page top-to-bottom:

Modalities below — what signal each one carries.
Cross-clip consistency — the one rigorous label-free test: which modalities form clusters that span multiple clips (= real activity) vs. clip-private (= visual fingerprinting).
Per-modality cluster quality matrix — quick eyeball of noise rate × clip × modality.
Per-clip drilldowns — timeline strips, SSM matrices, flow rhythm, top-tools per cluster.
Take-aways at the bottom — the five facts that fall out.

V-JEPA only

Kinetics-pretrained temporal video model. Per-segment 1024-D embedding, HDBSCAN on cosine.

OWLv2 tools

Per-segment open-vocab tool-presence histogram (19 prompts). HDBSCAN on log-normed cosine.

Flow rhythm

Per-frame flow_mag windowed FFT → (period, periodicity strength, mean flow). HDBSCAN.

Ego-motion

RANSAC affine on dense flow → (camera, residual, ratio) summaries. Falls back to flow-derived proxy when the egomotion cache isn't built.

Body pose

MediaPipe Pose 33 keypoints at source FPS → 18-D segment summary (shoulder span, elbow angles, body-bob period).

Late fusion

Per-modality z-score, L2-normalize, weighted concat, HDBSCAN on cosine. Combines all five above.

DINOv2-SSM × flow agreement

Cross-modal periodicity consensus (the RepNet stand-in). For each segment we estimate a rep period from (a) DINOv2 self-similarity off-diagonal autocorrelation and (b) flow-magnitude autocorrelation, and accept the period only when both modalities agree within 30%.

Cross-clip consistency · the only label-free quality test

Per-clip noise rates and cluster counts can't tell us whether a modality is doing real activity-clustering or just visual- fingerprinting each clip's own lighting and scene. To get at that, we pool all 610 segments from the 5 featured clips into one matrix per modality and run HDBSCAN once. A modality whose clusters MIX segments from multiple clips is finding shared activity patterns; one whose clusters are clip-pure is just fingerprinting visual style.

Cross-clip mixing entropy per cluster = $H(p_{\text{clip}}) / \log(\text{n unique clips})$ — 0 means the cluster sits entirely in one clip, 1 means it splits uniformly across every clip it touches. The modality score below is the size-weighted mean across its clusters, plus the fraction of clusters that contain segments from at least 2 different clips.

Ego-motion

mean entropy 0.927 · multi-clip clusters 100% · k=6 · noise 324/610

Each row is one cross-clip cluster (top 6 by entropy). Each column is a source clip. The thumbnail is the medoid segment of that cluster from that clip. Empty cells = the cluster doesn't contain segments from that clip.

Flow rhythm

mean entropy 0.877 · multi-clip clusters 100% · k=21 · noise 201/610

Body pose

mean entropy 0.873 · multi-clip clusters 100% · k=3 · noise 38/610

V-JEPA only

mean entropy 0.869 · multi-clip clusters 25% · k=4 · noise 19/610

OWLv2 tools

mean entropy 0.789 · multi-clip clusters 50% · k=4 · noise 81/610

Read of the ranking: Body-pose, flow-rhythm and ego-motion all hit 100% multi-clip clusters — they only form a cluster when the same kind of motion repeats across multiple clips. V-JEPA scores high entropy (0.87) but only 25% of its clusters are multi-clip — the rest are clip-private, meaning V-JEPA mostly fingerprints scene appearance. OWLv2 sits in the middle: one big shared "construction work" cluster plus several clip-specific tool niches.

Per-modality cluster quality (lower noise % = more confident; clu = cluster count)

"Noise" is the fraction of segments HDBSCAN couldn't confidently assign to any cluster. Very low noise + 2-3 clusters often means the modality saw the clip as homogeneous; very high noise means the signal isn't strong enough to separate activities. The sweet spot is moderate noise with 3-7 well-defined clusters.

clip	V-JEPA only	OWLv2 tools	Flow rhythm	Ego-motion	Body pose	Late fusion
01_brick_laying_demo	76% k=2	50% k=2	38% k=6	56% k=4	11% k=2	49% k=2
02_production_masonry	64% k=4	34% k=2	12% k=7	15% k=2	14% k=2	34% k=2
03_production_masonry	56% k=5	58% k=8	26% k=12	52% k=2	54% k=6	55% k=7
06_demo	24% k=3	26% k=4	36% k=5	71% k=3	56% k=2	0% k=2
11_prep_demo	47% k=3	62% k=2	10% k=2	50% k=2	60% k=3	35% k=2

01_brick_laying_demo

Each strip below shows how a different modality labelled the segments of this clip on the same time axis. Black = noise (no cluster assigned). Same color → same cluster within that strip (colors are not meaningful across strips).

V-JEPA only

k=2 · noise 76% / 117 segs

C0: 6.0s seg
C1: 5.9s seg

OWLv2 tools

k=2 · noise 50% / 117 segs

C0: rubber mallet
C1: gloved hand

Flow rhythm

k=6 · noise 38% / 117 segs

C0: period 3.9s
C1: period 11.6s
C2: period 11.6s
C3: period 5.8s

Ego-motion

k=4 · noise 56% / 117 segs

C0: cam 7.68
C1: cam 10.12
C2: cam 11.04
C3: cam 10.79

Body pose

k=2 · noise 11% / 117 segs

C0: bob 0.0s
C1: bob 1.9s

Late fusion

k=2 · noise 49% / 117 segs

Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 40 / 117 segments (34%). Per-segment table at /experiments/movement_clustering/repnet_consensus/01_brick_laying_demo/rep_table.json.

DINOv2 self-similarity matrix

Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

02_production_masonry

V-JEPA only

k=4 · noise 64% / 254 segs

C0: 5.5s seg
C1: 4.6s seg
C2: 5.2s seg
C3: 5.6s seg

OWLv2 tools

k=2 · noise 34% / 191 segs

C0: mason trowel
C1: safety vest

Flow rhythm

k=7 · noise 12% / 191 segs

C0: period 11.6s
C1: period 3.9s
C2: period 11.6s
C3: period 11.6s

Ego-motion

k=2 · noise 15% / 191 segs

C0: cam 4.80
C1: cam 8.62

Body pose

k=2 · noise 14% / 191 segs

C0: bob 0.0s
C1: bob 1.8s

Late fusion

k=2 · noise 34% / 191 segs

Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 67 / 191 segments (35%). Per-segment table at /experiments/movement_clustering/repnet_consensus/02_production_masonry/rep_table.json.

DINOv2 self-similarity matrix

Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

03_production_masonry

V-JEPA only

k=5 · noise 56% / 187 segs

C0: 4.5s seg
C1: 8.0s seg
C2: 7.2s seg
C3: 5.9s seg

OWLv2 tools

k=8 · noise 58% / 187 segs

C0: rubber mallet
C1: concrete block
C2: rubber mallet
C3: rebar

Flow rhythm

k=12 · noise 26% / 187 segs

C0: period 5.8s
C1: period 11.6s
C2: period 11.6s
C3: period 11.6s

Ego-motion

k=2 · noise 52% / 187 segs

C0: cam 2.71
C1: cam 7.98

Body pose

k=6 · noise 54% / 187 segs

C0: bob 2.2s
C1: bob 0.0s
C2: bob 0.0s
C3: bob 0.0s

Late fusion

k=7 · noise 55% / 187 segs

Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 68 / 187 segments (36%). Per-segment table at /experiments/movement_clustering/repnet_consensus/03_production_masonry/rep_table.json.

DINOv2 self-similarity matrix

Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

06_demo

V-JEPA only

k=3 · noise 24% / 55 segs

C0: 5.5s seg
C1: 6.3s seg
C2: 4.7s seg

OWLv2 tools

k=4 · noise 26% / 55 segs

C0: mason trowel
C1: mortar bucket
C2: rebar
C3: concrete block

Flow rhythm

k=5 · noise 36% / 55 segs

C0: period 11.6s
C1: period 11.6s
C2: period 11.6s
C3: period 3.4s

Ego-motion

k=3 · noise 71% / 55 segs

C0: cam 2.17
C1: cam 1.81
C2: cam 0.99

Body pose

k=2 · noise 56% / 55 segs

C0: bob 0.3s
C1: bob 0.9s

Late fusion

k=2 · noise 0% / 55 segs

Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 24 / 55 segments (44%). Per-segment table at /experiments/movement_clustering/repnet_consensus/06_demo/rep_table.json.

DINOv2 self-similarity matrix

Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

11_prep_demo

V-JEPA only

k=3 · noise 47% / 60 segs

C0: 6.2s seg
C1: 7.3s seg
C2: 7.2s seg

OWLv2 tools

k=2 · noise 62% / 60 segs

C0: level
C1: bare hand

Flow rhythm

k=2 · noise 10% / 60 segs

C0: period 11.6s
C1: period 2.9s

Ego-motion

k=2 · noise 50% / 60 segs

C0: cam 12.08
C1: cam 12.18

Body pose

k=3 · noise 60% / 60 segs

C0: bob 0.0s
C1: bob 0.0s
C2: bob 0.0s

Late fusion

k=2 · noise 35% / 60 segs

Cross-modal rep counting: DINOv2 SSM and flow autocorrelation agreed on a rep period in 18 / 60 segments (30%). Per-segment table at /experiments/movement_clustering/repnet_consensus/11_prep_demo/rep_table.json.

DINOv2 self-similarity matrix

Bright square blocks on the diagonal = distinct activity periods. Bright off-diagonal stripes at lag k = repetitive motion at period k/fps.

Flow-rhythm time series

Top: per-frame flow magnitude. Middle: dominant period from a 12-s sliding FFT. Bottom: periodicity strength (peak / total power).

Take-aways

The hand is the wrong primitive. MediaPipe + WiLoR-CSRT max out at 47% coverage on the cleanest demo clip and 21% on the worst. Asking "what's the hand doing?" is the wrong question when the hand isn't there.
OWLv2 tool-presence yields the most interpretable clusters. On 06_demo, the top-tools per cluster are mason trowel / mortar bucket / rebar / concrete block — the actual sub-activities. No keypoint required.
Late fusion of all five modalities wins on the cleanest clip. 0% noise on 06_demo with 2 confident activity blocks aligned to the visible task transitions. On the harder production-masonry clips it matches but doesn't beat the best single modality.
Body-pose covers ~15% of frames — the helmet camera looks down at work, not at the worker's body. Pose helps fusion but doesn't carry it.
DINOv2-SSM × flow agreement gives confident rep counts in 30-44% of segments without any keypoint pipeline. This is the cheapest substitute for RepNet on this footage.