Agentic 3D Scene Navigation

a tool-using language model that answers questions about a video, in 3-D

Jie Kai Tao

University of Florida · April 2026

Upload any handheld indoor video. The system reconstructs the room in 3-D, labels what's in it, and lets you ask the scene free-form questions — "where is the mini fridge?", "where would I put my suitcase?", "show me the dirtiest area." A language model walks the 3-D scene through a small set of geometry tools, picks the source-video frame that best shows its candidate, and grounds its answer with a box drawn back onto the real footage. The whole loop streams to the browser so you watch a small figurine move through the reconstructed room and see the source frame appear inline in chat with the box overlaid.

1 · The problem

Recent feed-forward 3-D models — Pi³ [1], VGGT [2] and the broader DUSt3R family [3] — make dense 3-D reconstruction from casual video almost free. A 60-second handheld pan reconstructs in seconds. But what comes out is a point cloud and a few hundred camera poses; you can't ask it questions. Existing 3-D scene-question-answering datasets like ScanQA [4] and 3D-LLM [5] assume a curated mesh with bounding-box annotations, not a raw cloud you reconstructed five minutes ago.

Two adjacent ideas — multimodal language models that ground language in single images, and tool-using agents in the ReAct style [6] — also can't answer the question alone. A single language-model call can't keep a few hundred frames in context, and it can't project a 2-D box back into 3-D. A pure geometry loop without language can't interpret affordance questions like "where would I put a suitcase?"

The contribution is the scaffold: a deterministic spatial tool layer that the language model plans over. The cloud, the visibility checks, the multi-view box-to-3-D segmentation, the best-frame retrieval — all of these are ordinary code with tests. The language model only decides what to inspect next and when the answer is good enough. Geometry stays geometry.

2 · How it works, in plain words

Five stages run end to end, all behind one upload page.

Reconstruct. Frames are sampled from the upload, fed to a feed-forward 3-D model (Pi³ or VGGT), and turned into a single point cloud plus the camera path through it.
Detect. An open-vocabulary 2-D detector (OWLv2 [7]) runs over the source frames with a list of indoor object words (bed, dresser, mini fridge, and ~80 others). The per-frame detections are cached.
Bake the labels into 3-D. Every detection's box is projected into the cloud and used to vote on the labels of the points it covers. Voting across many frames cancels out one-off false positives. The result is a per-cluster label inventory.
Add an open-vocabulary fallback. A vision-language image encoder (SigLIP [8]) tags each cluster's best representative crop with an open embedding. That covers questions the fixed detector vocabulary never sees, like "the bathroom" or "the dirtiest area".
Serve. The browser loads the cloud and renders it interactively. The agent loop streams events live, so you see the figurine move and watch evidence views appear as the model decides them.

3 · The spatial tools

The agent isn't free-styling — it picks one tool at a time from a small, well-tested catalog. Each tool is an ordinary Python function with a JSON schema. The language model sees the catalog and decides which one to call next.

Tool	What it does
`scene_overview`	Returns the cloud's bounds, camera path, the inventory of labels in it, and a top-down thumbnail.
`semantic_search`	Finds clusters whose label matches a phrase, optionally filtered by colour or open-vocabulary similarity.
`detections_search`	Searches the 2-D detection cache for a phrase and returns the frames it appears in.
`look_at_region` · `render_view`	Move the figurine, render what it sees from that viewpoint.
`best_frame_for_region`	Picks the source-video frame that shows a candidate region most clearly.
`ground_bbox_with_vlm`	Asks a multimodal language model to draw a box around something inside a chosen frame.
`segment_points_from_bbox`	Turns a 2-D box back into a 3-D set of points, keeping only what's visibly consistent across nearby frames.
`spatial_relation_query`	"Between", "near", "left of" — answered by geometry, not by guessing.
`next_best_view`	Suggests the next viewpoint that would most reduce uncertainty about the current question.
`color_picker`	Mean colour of a region, plus which named colour it's closest to.
`finalize_answer`	Commits the answer and writes the source frame with the bounding box overlaid.

4 · The three pieces that needed real engineering

Picking the right source frame

A 3-D candidate region needs to be shown to the user as a real photo, not a synthesised cloud render. The system scores every source frame against the candidate on four things: how much of the candidate is actually visible from that frame (so we don't pick a frame where it's occluded), how big it appears, how head-on the angle is, and whether the frame itself is sharp. The top-scoring frame becomes the evidence image. Most of these signals exist in prior work like OpenScene [9], but combining them into a single objective for picking one frame to show as a grounded answer is what we needed for this loop.

Turning a 2-D box into a 3-D segment

A naïve back-projection of a 2-D box leaks labels onto the wall behind it. Our pipeline only keeps cloud points whose depth lines up with the depth that the 3-D model predicts at that pixel, and it cross-checks neighbouring frames so that the candidate has to look consistent from more than one angle. The resulting region gets a trust score; if it's too low the agent is told to try another view rather than committing.

Choosing where to look next

Rather than optimise for reconstruction coverage like classical next-best-view methods, we optimise for the question being asked: pick the viewpoint that is most likely to resolve the current candidate hypotheses. Concretely, the system favours views that see several remaining candidates at once and that are close to the current figurine pose, so the user isn't watching it teleport across the room.

5 · How well it works

Hotel-room pan

On a 60-second handheld pan of a hotel room with a 20-question battery, a deterministic test planner resolved 19 of 20 queries in an average of about 8 seconds. Swapping in the language-model planner (Gemma 4 [10]) on a balanced object / colour / relation / affordance subset resolved 14 of 14 with natural-language answers — including reasoning like "You can put your suitcase on the dresser" and "The dirtiest area is the trash can."

Pi³ vs VGGT on the same input

3-D backbone	Pixels kept	Cluster yield
VGGT	22 %	139 clusters
Pi³	54 %	14 clusters / scene at default settings

Pi³ produces a denser per-frame cloud and is much faster, but its default pixel cap holds resolution down — fine for the model itself, a little tight for the open-vocabulary detector when the target object is small. With more frames per scene the cluster count scales: an 80-frame scene from Pi³'s public examples gives 153 clusters.

6 · What this is, and what it isn't

It is a working scaffold for asking arbitrary questions about casually-captured indoor scenes, with answers grounded back to a real source frame. Round-trip latency on an object-localisation question is around 6-15 seconds with the deterministic planner and 20-35 seconds with the language-model planner.

It isn't a replacement for a curated mesh-based 3-D scene benchmark. The cloud is still a thin per-frame surface shell; occlusion reasoning is approximate; the agent can be wrong if the detector missed an object in the source frames. Two design choices keep that honest: the final evidence is always a real source frame, not a synthesised novel view, and a low trust score forces the agent back into another viewpoint instead of committing.

7 · What this doesn't do yet

Sparse-cloud novel views look hollow. The system avoids using them as primary evidence; the agent has to commit to a real source frame. A Gaussian-Splatting [11] bake on top of the posed frames would close this gap.
Reasoning is bounded by the detector vocabulary. The open-vocabulary fallback covers most of this, but very specific objects ("the AirPods on the desk") force a fallback to the multimodal language model for box grounding, which roughly doubles per-step latency.
No memory across questions. The belief filter resets each query. A persistent scene memory that updates across user turns is a straightforward follow-up.

References

Y. Wang et al., "π³: Scalable Permutation-Equivariant Visual Geometry Learning," in Proc. Int. Conf. Learn. Representations (ICLR), 2026.
J. Wang et al., "VGGT: Visual Geometry Grounded Transformer," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.
S. Wang et al., "DUSt3R: Geometric 3D Vision Made Easy," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024.
D. Azuma et al., "ScanQA: 3D Question Answering for Spatial Scene Understanding," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022.
Y. Hong et al., "3D-LLM: Injecting the 3D World into Large Language Models," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023.
S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," in Proc. Int. Conf. Learn. Representations (ICLR), 2023.
M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling Open-Vocabulary Object Detection," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023.
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, "Sigmoid Loss for Language Image Pre-Training," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023.
S. Peng et al., "OpenScene: 3D Scene Understanding with Open Vocabularies," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023.
Gemma Team, Google DeepMind, "Gemma: Open Models Based on Gemini Research and Technology," Tech. Rep., 2024.
B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, "3D Gaussian Splatting for Real-Time Radiance Field Rendering," ACM Trans. Graph., vol. 42, no. 4, 2023.