DreamZero vs. π0.5: head-to-head evaluation on DROID in Isaac Sim.

Zero-Shot Policy Comparison of DreamZero and π0.5

A Task-by-Task Evaluation of NVIDIA’s latest World Action Model in a DROID simulation environment

NVIDIA recently released DreamZero, a 14-billion-parameter world action model that shows promise in generalizing across unseen environments and tasks, significantly outperforming state-of-the-art vision-language-action (VLA) models like π₀.₅. Given DreamZero’s promise, I want to test it firsthand to see how well it performs across a variety of tasks. Thus, I test DreamZero across 12 tasks in a simulated DROID environment and compare it against π₀.₅.

The tasks are organized into three tiers: basic manipulation, manipulation with semantically more challenging instructions, and non-manipulation / adversarial tasks. The last category is mostly driven out of curiosity to see how DreamZero behaves on non-standard manipulation tasks.

Table of Contents

Policy Overview

DreamZero introduces a new class of robot foundation model, the World Action Model, that departs from the VLA paradigm in multiple ways, including architecture and training data. Before moving to the experiments, it is worth highlighting some of these differences.

Architecture

π₀.₅ inherits its backbone from the Vision-Language Model, PaliGemma, pretrained on static image-text pairs. DreamZero initializes from Wan2.1-I2V-14B, a 14-billion-parameter image-to-video diffusion model pretrained on web-scale video data. The practical consequence is that DreamZero enters robot training already knowing how objects and scenes evolve.

Figure 1: DreamZero architecture. During training (left), the model jointly denoises video and action latents conditioned on clean video context, language, and proprioceptive state. During inference (right), the model autoregressively generates future frames and action chunks, replacing predicted frames with ground-truth observations after each chunk executes.

As shown in Figure 1, DreamZero feeds three inputs into a shared Diffusion Transformer (DiT): visual context encoded via a Variational Autoencoder; language instructions via a text encoder; and proprioceptive state via a state encoder. The DiT jointly denoises both future video frames and actions through separate output heads, motivated by the hypothesis that a model forced to predict plausible visual futures internalizes physical dynamics and thus makes better action predictions.

Training Data

π₀.₅ is built on a PaliGemma backbone and co-trained with heterogeneous data sources, such as the Open X-Embodiment (OXE) dataset, custom-collected robot data from mobile and non-mobile robots across diverse home environments, and non-robotic data such as CapsFusion and COCO.

DreamZero, on the other hand, does not use any non-robotic data. Instead, DreamZero only trains on robot data specific to the experimental platforms used for testing, an AgiBot G1 bimanual robot and a Franka single-arm robot. For the AgiBot G1, the authors collected approximately 500 hours of teleoperation data across 22 real-world environments (homes, restaurants, supermarkets, and offices). For the Franka arm, the authors trained on the publicly available DROID dataset, which was collected exclusively with the Franka arm.

To compare π₀.₅ against DreamZero, I use the DreamZero DROID checkpoint and the finetuned checkpoint publicly released by Physical Intelligence, which adapts the base π₀.₅ model to the DROID environment.

Experimental Setup

To evaluate the policies, I apply both of their publicly released DROID checkpoints in sim-eval’s DROID environment, which is based on Isaac Sim/Lab. I specifically use the first scene of sim-eval, which places a Franka arm at a fixed table with a Rubik’s Cube and a red plastic bowl (see Figure 2). During inference, I record the camera observations that are used as input to the policies. DreamZero uses three cameras: a right external camera, a wrist camera, and a left external camera, each recorded at 448 × 448 pixels. π0.5 uses two cameras: an external camera and a wrist camera. These cameras are rendered from the raw simulator output at 720 × 1280 RGB rather than the model’s 224 × 224 input.

Figure 2: Scene 1 of sim-eval’s DROID environment.

Tasks

Each policy receives 12 task instructions. For each task, both policies have 30 seconds to complete the task. Due to the policies’ probabilistic nature, each task is repeated three times. The 12 tasks are:

Basic manipulation, Tasks 1–4:

  1. “Pick up the cube and place it in the bowl”
  2. “Pick up the bowl and place it on the Rubik’s Cube”
  3. “Move the cube to the left side of the table”
  4. “Move the bowl to the right side of the table”

Advanced Semantics, Tasks 5–8:

  1. “Push the cube to the left with the outside of the gripper”
  2. “Place the smaller object inside the larger object”
  3. “Place the Rubik’s Cube on the side so the red center square is facing up”
  4. “Place the cube at the corner of the table closest to the sofa”

Adversarial / Non-Manipulation, Tasks 9–12:

  1. “Lift the table”
  2. “Throw the bowl across the room”
  3. “Look out the window”
  4. “Slam your gripper into the table”

Inference Backends

To apply the policies in sim-eval, I send observations and receive action chunks via a server-client connection. In the DreamZero case, the policy is hosted by NVIDIA and receives observations and sends action chunks via a WebSocket API. For π0.5, I use a similar setup, except π0.5 is hosted locally with openpi.

Simulation Parameters

The IsaacLab DROID environment runs at 120 Hz with decimation = 8, meaning each action is held for 8 physics ticks, giving a control frequency of $\Delta t_{\text{action}} = 8 \times (1/120) \approx 15\,\text{Hz}$. The two policies differ in how many steps they execute per action chunk:

  • DreamZero uses open_loop_horizon = 24, so each chunk spans $24 / 15 \approx 1.6\,\text{s}$ of control. The model receives one wrist camera and two external cameras, all at 180 × 320 pixels.
  • π0.5 uses open_loop_horizon = 8, so each chunk spans $8 / 15 \approx 0.53\,\text{s}$ of control.

For the experiments, I use the default parameters provided by the authors’ evaluation codebases.

Evaluation

Each task consists of 3 independent episodes, resulting in 36 episodes total. Each episode is scored on a pass/fail basis using a qualitative criterion: if a policy’s behavior broadly corresponds to the task instruction, regardless of execution precision, the episode is counted as a success. Partial attempts that do not meaningfully satisfy the instruction are scored as failures. When both policies succeed in the same episode, the verdict is recorded as a Tie; when both fail, it is recorded as a Fail.

Category 1: Basic Manipulation

The first category evaluates both policies on basic manipulation tasks. The videos show the observations that the policies receive during inference.

Task 1: “Pick up the cube and place it in the bowl”

Video 1: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Succeeds. Picks it up again laterSucceedsTie
2Succeeds. Tries picking up againSucceedsTie
3Fails twice, tries to recoverSucceedsπ0.5

DreamZero mostly succeeds but occasionally seems hesitant. π0.5 executes this task without problems.

Task 2: “Pick up the bowl and place it on the Rubik’s Cube”

Video 2: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Picks up bowl and drops on cube, Bowl falls to groundPicks up bowl and drops on cube. Bowl falls to one side and leansFail
2Picks up bowl, drops on cube, misses, then tries to recoverSimilar as previous episodeFail
3Picks up bowl and drops on cube. Bowl ends up halfway on cubeSimilar as previous episodeFail

Both models struggle with secure placement due to the small surface area of the Rubik’s Cube.

Task 3: “Move the cube to the left side of the table”

Video 3: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1SucceedsBarely moves cube to the left side of the tableTie
2SucceedsSimilar to the previous episodeTie
3SucceedsSucceedsTie

DreamZero successfully moves the Rubik’s Cube to the left side of the table for the three episodes. π0.5 moves the cube leftward in all episodes. However, for the first two episodes, the cube barely moves to the left. Regardless, since in both episodes the cube appears more on the left side of the table than the right side of the table, I score all three episodes as successful.

Task 4: “Move the bowl to the right side of the table”

Video 4: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Succeeds; keeps interacting with the bowl afterwardsMoves bowl but not to the rightDZ
2Similar to the previous episodeMoves bowl to the right, less far but correct sideTie
3Similar to the previous episodeMoves bowl to the right and drops itTie

DreamZero succeeds in all episodes. π0.5 achieves the correct direction in episodes 2 and 3.

Category 2: Advanced Semantics

These tasks test whether the models can ground abstract semantic descriptions such as relative size, color, and spatial landmarks into correct actions. Unlike canonical pick-and-place examples, these tasks require reasoning about object properties or following constraints that might be outside of the original training distribution.

Task 5: “Push the cube to the left with the outside of the gripper”

Video 5: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Slightly pushes cube with outside of gripper; Motion from wrist camera view is to the left; However, motion from the external camera view is to the rightTries to grab the cube, then touches it with outside of gripper; no pushing motionFail
2Tries to push cube but failsGrabs the cube; not using outside of gripperFail
3Tries to push cube but failsGrabs cube again and moves itFail

DreamZero shows a genuine understanding of the “outside of the gripper” constraint in all episodes. Interestingly enough, in episode 1, DreamZero slightly pushes the cube. However, the direction is to the right from the wrist camera’s viewpoint and to the left from the external cameras’ viewpoints. Since DreamZero used the external cameras as a reference in the previous tasks, episode 1 is a failure. Nevertheless, a human might struggle on this task as well due to the ambiguity of the viewpoints in the task instruction.

π0.5 reverts to its default grasp-and-move behavior, ignoring the gripper constraint entirely.

Task 6: “Place the smaller object inside the larger object”

Video 6: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1SucceedsSucceedsTie
2SucceedsSucceedsTie
3SucceedsSucceedsTie

Both models correctly place the cube in the bowl. Whether this reflects genuine relative-size reasoning or a common pick-and-place pattern from DROID training is hard to tell from this task alone.

Task 7: “Place the Rubik’s Cube on the side so the red center square is facing up”

Video 7: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Fails to turn cubeFails to manipulate cubeFail
2Moves the cube with one swift movementFails to manipulate cubeDZ
3Turns cube but wrong sideFails to manipulate cubeFail

The color-conditioned orientation task requires both fine manipulation and pose reasoning. π0.5 does not engage the cube at all. DreamZero tries to turn the cube each episode and succeeds in turning it to the correct side in episode 2.

Task 8: “Place the cube at the corner of the table closest to the sofa”

Video 8: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Doesn’t place cube close to sofaDoes not pick up cube, freezesFail
2Doesn’t place cube close to sofaFreezesFail
3Doesn’t place cube close to sofaFreezesFail

Both models fail. DreamZero picks up the cube but does not place it at the correct location. π0.5 does not pick up the cube at all.

Category 3: Adversarial / Non-Manipulation

This category includes tasks that are either physically impossible given the environment (Task 9, since the table is fixed) or require behaviors well outside the manipulation domain the policies were trained on. Though it is unlikely a robot would ever be asked to perform these tasks in the real world, I was curious to see how the policies would behave, given that a child would intuitively understand what to do.

Task 9: “Lift the table”

Video 9: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Makes an effort to go to the side of the table and attempts to liftFocuses on the cubeDZ
2Makes another effort, tries gripping the end of the tableFocuses on the cubeDZ
3Tries grabbing the far edge of the tableFocuses on the bowlDZ

Since the table is fixed in the environment, this task tests whether the policy makes a reasonable effort. DreamZero clearly tries to manipulate the table. π0.5 falls back to manipulating the objects on the table.

Task 10: “Throw the bowl across the room”

Video 10: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Picks up bowl and drops it. No throwingPicks up bowl, moves it across the table, then drops it. No throwingFail
2Picks up bowl and drops it. No throwingSimilar as previous episodeFail
3Picks up bowl and drops it. No throwingSimilar as previous episodeFail

Neither model generates a throwing motion.

Task 11: “Look out the window”

Video 11: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Looks out the windowMoves toward window, but gets distracted by the tableDZ
2Looks out the windowBriefly tries to look toward window, then returns to tableDZ
3Looks out the windowLooks around the room, then back at the tableDZ

DreamZero succeeds in all three episodes, an indication of its video-grounded world understanding. π0.5 tries occasionally but gets distracted by the table.

Task 12: “Slam your gripper into the table”

Video 12: Top: DreamZero (ext. cam 1, wrist, ext. cam 2); Bottom: π0.5 (ext. cam, wrist).
#DreamZero (DZ)π0.5Verdict
1Doesn’t touch the table nor any fast movementDoesn’t touch the table nor any fast movementFail
2Similar as previous episodeSimilar as previous episodeFail
3Similar as previous episodeSimilar as previous episodeFail

This task tests how the policies react to adversarial instructions. Neither policy touches the table nor makes any movement that resembles slamming the gripper.

Conclusion

The table below summarizes results across all 36 episodes.

#TaskDZ Winsπ0.5 WinsTiesBoth Fail
1Pick up cube and place in bowl12
2Pick up bowl and place on Rubik’s Cube3
3Move the cube to the left side of table3
4Move the bowl to the right side of the table12
5Push cube left with outside of gripper3
6Place the smaller object inside the larger object3
7Orient cube so red center square faces up12
8Place cube at corner closest to sofa3
9Lift the table3
10Throw the bowl across the room3
11Look out the window3
12Slam your gripper into the table3
Total811017

From the qualitative tests, several observations stand out.

The two policies perform similarly on basic manipulation, with results largely consistent across Tasks 1, 3, and 4. Manipulation requiring precision, as in Task 2, causes both to fail. Both models also show some spatial understanding, as reflected in Tasks 3 and 4, though it is hard to assess for π₀.₅ in Task 3, given the qualitative nature of the tests. DreamZero further shows confusion about which direction it treats as left and right, as seen in Task 5.

Both models show some semantic understanding, as seen in Task 6, with DreamZero having a slight edge in Task 7. Most semantic tasks, however, prove too challenging for either policy: Tasks 5, 7, and 8 all end in failure.

On the adversarial and non-manipulation tasks, π₀.₅ fails completely. DreamZero, however, succeeds on Tasks 9 and 11, giving the strongest indication that the Wan2.1-I2V-14B backbone transfers to tasks DreamZero has never seen before. What is surprising is that DreamZero comes up with reasonable actions in Tasks 9 and 11 but completely fails in Tasks 10 and 12, even though neither appears to be semantically harder.

Key Takeaways

  1. DreamZero and π₀.₅ show comparable overall manipulation skills, but both seem to struggle on high-precision tasks, a limitation the DreamZero authors also noted.

  2. DreamZero can adapt better to unseen and non-manipulation tasks. π₀.₅ is clearly limited and often reverts to pick-and-place behavior regardless of instruction, an observation the DreamZero authors also made.

  3. Both models show some spatial and semantic understanding, but also feature clear limitations. Both perform poorly on more complex, semantically challenging tasks, such as “placing the cube at the corner of the table that is closest to the sofa”.

  4. DreamZero succeeds at some non-manipulation tasks. However, in other tasks like “throw the bowl across the room”, DreamZero is unable to produce any throwing motion, which seems surprising given its image-to-video diffusion backbone has likely seen throwing motions in training.

Since the tests were qualitative and limited in sample size, none of these observations or takeaways should be treated as definitive claims. More analysis is needed, which I leave for another day. If you have comments or questions, reach out to me on LinkedIn.

References

Related