TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen¹ Yue Chen¹ Yuliang Xiu¹ Andreas Geiger² Anpei Chen¹

¹Westlake University ²University of Tübingen, Tübingen AI Center

TL;DR: A simple state update rule to enhance length generalization for CUT3R.

Abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a 2$ \times $ improvement in global pose estimation over baselines, all while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images.

Length Generalization Results

TTT3R improves length generalization and mitigates forgetting. For sequences beyond 1000 frames, we incorporate TTT3R with a state reset process, please see Sec. A.1 of Sup.Mat. for details.

Motivation

Real-world applications often require handling an arbitrary number of images.

Inference GPU Memory Runtime Camera Pose Error

Recent feed-forward methods (e.g., VGGT, Point3R) suffer from high memory consumption. Notably, only CUT3R achieves constant memory usage with RNN design. However, as illustrated above, CUT3R fails to generalize to long sequences due to training on most 64-frame sequences.

How TTT3R works

CUT3R TTT3R

To address forgetting, we retain the cross-attention formulation but introduce a per-token learning rate $\beta_t$, derived from alignment confidence between state and observations. This acts as a soft gate, improving long-context extrapolation.

Per-token learning rate visualization

Instead of updating all states uniformly, we incorporate image attention (i.e., $\mathbf{Q}_{\mathbf{S}_{t-1}} {\mathbf{K}^{\top}_{\mathbf{X}_t}}\in\mathbb{R}^{n\times(h\times w)}$) as per-token learning rates $\beta_t\in \mathbb{R}^{n \times 1}$. High-confidence matches get larger updates, while low-quality updates are suppressed for better performance.

Spann3R: 3D Reconstruction with Spatial Memory

VGGT: Visual Geometry Grounded Transformer

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

StreamVGGT: Streaming 4D Visual Geometry Transformer

Acknowledgements

We thank the members of Inception3D and Endless AI Labs for their help. Xingyu Chen and Yue Chen are funded by the Westlake Education Foundation. Xingyu Chen is also supported by the Natural Science Foundation of Zhejiang province, China (No. QKWL25F0301). Andreas Geiger is supported by the ERC Starting Grant LEGO-3D (850533) and DFG EXC number 2064/1 - project number 390727645.

BibTeX


    @article{chen2025ttt3r,
        title={TTT3R: 3D Reconstruction as Test-Time Training},
        author={Chen, Xingyu and Chen, Yue and Xiu, Yuliang and Geiger, Andreas and Chen, Anpei},
        journal={arXiv preprint arXiv:2509.26645},
        year={2025}
        }