Reflective VLA: In-Context Action Consequences Make VLAs Generalize

Qing Lian¹, Kent Yu^1,2,3, and Lei Zhang^1,2,3

¹Futian Laboratory
²International Digital Economy Academy (IDEA)
³Visincept

Contact: lianqing1997@gmail.com

Paper Code Coming Soon

Reflective VLA moves from reactive single-frame control to reflective control. Past observation-action-consequence triplets expose deployment-specific factors such as camera geometry and calibration bias, enabling the policy to adapt online without test-time fine-tuning.

Abstract

Most vision-language-action (VLA) models are reactive: they predict the next action from the current instruction and observation, implicitly assuming that the current observation fully specifies the action-relevant state. In embodied control, embodiment-specific factors such as camera-to-robot geometry, robot calibration, or systematic actuation bias are often hard to identify from a single observation. We propose Reflective VLA, which conditions each decision on a context of observation-action-consequence triplets. Each triplet records not only what the robot observed and executed, but also how the scene changed afterward, exposing the deployment-specific mapping from actions to observed effects. Across LIBERO, SimplerEnv-Bridge, LIBERO-Plus, LIBERO-Plus-Hard, and real-world cross-camera deployment, Reflective VLA improves generalization while preserving strong in-distribution performance.

Method

Reflective VLA packs historical triplets (O, A, O') and the current observation into one multimodal prompt. The action expert attends to the full shared VLM token sequence and predicts the next continuous action chunk.

The block-causal mask supervises all sampled targets in one forward pass while preventing future leakage.

Causal Triplets

Each context unit binds the pre-action observation, executed action chunk, and aligned post-action observation.

Block-Causal Training

A block-causal mask supervises multiple context positions in one forward pass without future leakage.

Cached Inference

A rolling context buffer reuses completed triplets and supports real-time online deployment.

Main Results

Method	SimplerEnv Avg	LIBERO-Plus Avg	LIBERO-Plus-Hard Avg
Reactive baseline pi_0.5	72.9	82.3	64.6
Reflective VLA (ours)	78.2	87.7	68.8
Gain	+5.3	+5.4	+4.2

Success rates are reported in percent. Reflective VLA preserves nominal benchmark performance and improves robustness under deployment shifts that change sensing or embodiment-specific factors.

Detailed LIBERO Results

Method	Spatial	Object	Goal	Long
Reactive baseline pi_0.5	97.5	98.2	97.8	94.0
Reflective VLA (ours)	98.4	99.0	98.2	94.6
Gain	+0.9	+0.8	+0.4	+0.6

LIBERO-Plus / Hard Perturbations

Method	Camera	Robot	Lang	Light	Bg	Noise	Layout	Camera*	Rob. Calib*
Reactive baseline pi_0.5	90.0	50.0	92.9	92.0	85.8	90.2	75.0	74.0	55.2
Reflective VLA (ours)	95.7	72.9	92.2	92.1	92.7	95.4	73.0	76.3	61.3

Camera* and Rob. Calib* denote the two harder LIBERO-Plus-Hard shifts.

Videos

LIBERO-Plus-Hard Visualizations

Representative Reflective VLA successful rollouts from LIBERO-Plus-Hard camera perturbation evaluation.

Open the middle drawer

Place the yellow dairy product in the basket

Move the black bowl onto the plate

Turn on the stove and place the moka pot

Real-Robot Deployment

Real-world robot clips are shown as a four-at-a-time autoplay carousel.