DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

1Seoul National University, 2University of Maryland, College Park, 3Georgia Institute of Technology
equal advising

TL;DR: [one-sentence summary of DynaFLIP goes here]

Teaser placeholder

Abstract

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image–language–3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space—a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

Overall Framework of DynaFLIP

DynaFLIP architecture

Dataset Viewer

💡 Click on any thumbnail to view the (image, 3D flow, language) triplet

Image
Image
3D Flow
3D Flow
Language

Select a thumbnail above.

Experiments

DynaFLIP learns dynamics-aware representations that preserve control-relevant information

Higher control-relevant score Sm (x-axis) correlates with higher manipulation success rate (y-axis). DynaFLIP sits at the top-right of both plots — its dynamics-aware representations preserve control-relevant information and improve manipulation performance.

Control-relevant metric figure

DynaFLIP produces more spatially coherent and object-aware feature structures than the baselines.

Original

DINOv2

SigLIP

DynaFLIP

Feature visualization with PCA

DynaFLIP attends to control-relevant regions (i.e., manipulated objects and interaction regions), whereas baselines distribute attention over less relevant areas such as the background or irrelevant objects.

Original

DINOv2

SigLIP

DynaFLIP

Grad-CAM heatmaps over action prediction

DynaFLIP serves as a visual backbone for diverse downstream policies (VLA, diffusion policy, MLP)

DynaFLIP outperforms baseline image encoders on real-world manipulation when used with a Vision-Language-Action model (π0.5).

Image
Encoder
Real-World Manipulation
Pick <object> into Sink Pour almonds into <object> Unfold Towel Mean
DINOv2 75654060.0
SigLIP 55602045.0
DynaFLIP 90705070.0

On the LIBERO benchmark with Diffusion Policy, DynaFLIP consistently outperforms baselines in both frozen and LoRA fine-tuned settings.

Image
Encoder
Language
Encoder
Frozen LoRA Fine-tuned
90GoalObjectSpatialLongMean 90GoalObjectSpatialLongMean
DINOv2CLIP 14.475.033.542.520.537.2 83.677.582.081.067.578.3
SigLIPSigLIP 24.354.513.052.08.530.5 82.680.582.074.076.579.1
DynaFLIPDynaFLIP 31.770.537.551.516.541.5 78.184.583.578.580.581.0

A lightweight three-layer MLP policy ensures that downstream performance reflects representation quality rather than policy capacity. DynaFLIP achieves the highest success rates on both MetaWorld and RLBench.

MetaWorld

Algorithm Easy (7) Medium (5) Hard & Very Hard (3) Mean
DINOv277.777.664.074.9
SigLIP74.372.856.770.4
DynaFLIP81.181.669.378.9

RLBench

Algorithm close box put rubbish in bin close laptop lid water plants unplug charger toilet seat down Mean
DINOv28412764248447.3
SigLIP804520127637.3
DynaFLIP8887620369654.0

DynaFLIP maintains strong performance under out-of-distribution perturbations

Image
Encoder
Visual, spatial perturbations Semantic perturbations Mean
DINOv2 17.527.522.5
SigLIP 25.030.027.5
DynaFLIP 40.075.057.5

DynaFLIP's focus on control-relevant regions enables it to remain robust to changes in object layout and the presence of distractors.

DynaFLIP incorporates language as one of its pre-training modalities and learns to align visual changes with task-relevant instructions, yielding representations that remain robust under unseen objects and instructions.

BibTeX

@inproceedings{TODO_bibkey,
  author    = {TODO and Lee, Jusuk and TODO},
  title     = {TODO: Paper Title},
  booktitle = {TODO: Venue},
  year      = {YYYY},
}