Original Paper : VLA-0: Building State-of-the-Art VLAs with Zero Modification
OpenSource repo : https://vla0.github.io/

Introductions

1. Discrete Token VLA (RT-2 / Open-VLA)

Key architecture : Robot actions, originally continuous, are discretized into bins; each bin is then assigned a token from the VLM vocabulary, using either new or infrequent tokens

Mostly made by google, it has two limitations.

  • As it restrict the actions into discrete bins, fine-grained control is limited. (resolution of actions)
  • It compromises pretrained language understanding of the VLM by repurposing its vocabulary for actions

2. Generative Action Head VLAs

Key architecture : The VLM is fine-tuned to predict a latent vector, which is then decoded into actions using a generative model such as a diffusion process or flow matching.

The primary drawback is,
This often leads to a decline in the language understanding and grounding capabilities of the underlying VLM.

3. Custom Architecture VLAs

It contains architectural modifications or custom tokenizers tailored to action prediction.

But, significant architectural changes, additional parameters, or custom training pipelines.


The key to our success lies in a carefully designed training and inference recipe, including action token masking and prediction ensembling, a critical component not explored in LLARVA(2-stage text based action prediction).


3. Methods

VLM (Vision-Languge-Mdoel)

  1. Pre-trained Encoder -> Projecting patches of Image into LLM’s embedding space.
  2. LLM tokenizer -> Embedding input text.

Backbone : Qwen-VL-2.5-3B model
For reproducibility, compute efficiency, competitive performance for its model size.

VLA-0

VLA-0 preserves the integrity of the underlying VLM : it does not introduce new tokens, alter the existing vocabulary, or add any new neural network layers.

However, achieving this performance relies on a careful recipe.

Inputs

System Prompt, Images, and a Task Instruction.

  • System Prompts :
Analyze the input image and predict robot actions for the next H timesteps. Each action has D dimensions. Output a single sequence of H× Dintegers (0 - B each), representing the H timesteps sequentially. Provide only space-separated numbers. Nothing else.
  • Images :


Simulation : (Third person view + wrist view) like basemodel
Real test : right / left image

  • Task Instruction :

e.g. “put the banana on the plate.”

Action Decoding

As VLA-0 returns the text generated action.

To simplify this task, we ask the VLM to output actions as integers.

Ensemble Prediction

At each inference step, the VLM predicts a sequence of nfuture actions.

We average these npredictions to produce the final, more stable action at time step t.

average-n-step-ACT

Masked Action Augmentation (training augmentation)

During training, we randomly mask out characters in the target action string.
This procedure forces the VLM to reason about the action based on the visual observation and instruction, rather than simply relying on auto-completing a numerical sequence it has started to generate.


4. EXPERIMENTS

Setup

Tasks (World)

  • reorienting ablock
  • pushing an apple
  • picking and placing a banana
  • picking and placing a cupcake.

For each task, we collect 100 demonstrations for training.

Tasks (Simulation)

LIBERO consists of four suites: Spatial, Object, Goal, and Long.

Each suite contains 10 tasks, and each task is tested over 50 episodes.

Baselines

See the Below table.

Tables

Models Large-scale
pre-train
VLA Type Spatial Object Goal Long Avg. Avg. rank
Diffusion Policy [4], [11]N/A78.392.568.350.572.46.5
π0-FAST (Paligemma) [2], [19]Custom87.063.089.048.071.86.0
SmolVLA (0.24B) [19]Gen Head87.093.088.063.082.85.3
SmolVLA (2.25B) [19]Gen Head93.094.091.077.088.84.0
OpenVLA-OFT [10]Custom94.395.291.786.591.92.8
π0.5 - KI [5]Gen Head96.697.294.685.893.32.3
VLA-0 (Ours)Simple97.097.896.287.694.71.0
Octo [21]Gen Head78.985.784.651.175.18.8
OpenVLA [11]Dis. Tok.84.788.479.253.776.58.0
π0-FAST [16]Custom90.086.095.073.086.06.5
Molmo Act [12]Dis. Tok.87.095.487.677.286.86.5
GR00T-N1 [1]Gen Head94.497.693.090.693.94.5
π0 [2]Gen Head96.898.895.885.294.23.3
π0.5 - KI [5]Gen Head98.097.895.685.894.33.0
OpenVLA-OFT [10]Custom97.698.497.994.597.11.5
VLA-0 (Ours)Simple97.097.896.287.694.72.8

Results

At simulation (LIBERO),

  • No large-scale pre-train : outperforming the second best method by 1.4 points on average.
  • With large-scale pre-train : Overall, it gets the second best average rank 2.8, trailing only OpenVLA-OFT [10] (average rank 1.5), a custom VLA model.

At real-world,

  • We compare with SmolVLA [19], a strong baseline that was specifically trained on the large-scale SO-100 dataset and has been shown to outperform popular methods like π0 [2] and ACT [23] on this platform.

Ablation studies

Row ID Ensemble Act. Masked Act. Aug. Tiled Img. Act. Res. Avg. Succ. Δ perf.
0 1000 94.7 0.0
1 1000 92.0 -2.0
2 1000 93.5 -1.2
3 4000 94.2 -0.5
4 250 93.2 -1.5
5 1000 94.5 -0.2


5. Conculsion

The key strength of the research is zero-modification of VLM.

Without adding extra action tokens or encoder, adopting carefully designed training and inference recipe can achieve SOTA when compared with samely not Largely pre-trained VLAs.


Learned

Action Vectors

  • (dx,dy,dz)
  • (droll, dpitch, dyaw)
  • (gripper)

roll, dpitch, dyaw

Does VLA-0 has general task model?

YES?

Personal Thoughts

결국 모든 VLA들은 현재 VLM의 이미지 기반 상황인식과 + text input을 동시에 받을 수 있는 지점을 이용한다.

VLM을 굳이 채택하는 이유는 결국 로봇의 상황판단은 물리세계에 대한 이해를 의미하고, 그 지점에서 대규모 인터넷 자료를 학습한 pre-trained VLM의 이점을 사용하는 것이다.