Reflective VLA: In-Context Action Consequences Make VLAs Generalize

Qing Lian1, Kent Yu1,2,3, and Lei Zhang1,2,3

1Futian Laboratory
2International Digital Economy Academy (IDEA)
3Visincept

Contact: lianqing1997@gmail.com

Reflective VLA teaser

Reflective VLA moves from reactive single-frame control to reflective control. Past observation-action-consequence triplets expose deployment-specific factors such as camera geometry and calibration bias, enabling the policy to adapt online without test-time fine-tuning.

Abstract

Most vision-language-action (VLA) models are reactive: they predict the next action from the current instruction and observation, implicitly assuming that the current observation fully specifies the action-relevant state. In embodied control, embodiment-specific factors such as camera-to-robot geometry, robot calibration, or systematic actuation bias are often hard to identify from a single observation. We propose Reflective VLA, which conditions each decision on a context of observation-action-consequence triplets. Each triplet records not only what the robot observed and executed, but also how the scene changed afterward, exposing the deployment-specific mapping from actions to observed effects. Across LIBERO, SimplerEnv-Bridge, LIBERO-Plus, LIBERO-Plus-Hard, and real-world cross-camera deployment, Reflective VLA improves generalization while preserving strong in-distribution performance.

Method

Reflective VLA method overview

Reflective VLA packs historical triplets (O, A, O') and the current observation into one multimodal prompt. The action expert attends to the full shared VLM token sequence and predicts the next continuous action chunk.

Block-causal attention mask

The block-causal mask supervises all sampled targets in one forward pass while preventing future leakage.

Causal Triplets

Each context unit binds the pre-action observation, executed action chunk, and aligned post-action observation.

Block-Causal Training

A block-causal mask supervises multiple context positions in one forward pass without future leakage.

Cached Inference

A rolling context buffer reuses completed triplets and supports real-time online deployment.

Main Results

Method SimplerEnv Avg LIBERO-Plus Avg LIBERO-Plus-Hard Avg
Reactive baseline pi_0.5 72.9 82.3 64.6
Reflective VLA (ours) 78.2 87.7 68.8
Gain +5.3 +5.4 +4.2

Success rates are reported in percent. Reflective VLA preserves nominal benchmark performance and improves robustness under deployment shifts that change sensing or embodiment-specific factors.

Detailed LIBERO Results

Method Spatial Object Goal Long
Reactive baseline pi_0.5 97.5 98.2 97.8 94.0
Reflective VLA (ours) 98.4 99.0 98.2 94.6
Gain +0.9 +0.8 +0.4 +0.6

LIBERO-Plus / Hard Perturbations

Method Camera Robot Lang Light Bg Noise Layout Camera* Rob. Calib*
Reactive baseline pi_0.5 90.0 50.0 92.9 92.0 85.8 90.2 75.0 74.0 55.2
Reflective VLA (ours) 95.7 72.9 92.2 92.1 92.7 95.4 73.0 76.3 61.3

Camera* and Rob. Calib* denote the two harder LIBERO-Plus-Hard shifts.

Videos

LIBERO-Plus-Hard Visualizations

Representative Reflective VLA successful rollouts from LIBERO-Plus-Hard camera perturbation evaluation.

Real-Robot Deployment

Real-world robot clips are shown as a four-at-a-time autoplay carousel.