DreamZero vs. π0.5: head-to-head evaluation on DROID in Isaac Sim.Zero-Shot Policy Comparison of DreamZero and π0.5
A Task-by-Task Evaluation of NVIDIA’s latest World Action Model in a DROID simulation environment
NVIDIA recently released DreamZero, a 14-billion-parameter world action model that shows promise in generalizing across unseen environments and tasks, significantly outperforming state-of-the-art vision-language-action (VLA) models like π₀.₅. Given DreamZero’s promise, I want to test it firsthand to see how well it performs across a variety of tasks. Thus, I test DreamZero across 12 tasks in a simulated DROID environment and compare it against π₀.₅.
The tasks are organized into three tiers: basic manipulation, manipulation with semantically more challenging instructions, and non-manipulation / adversarial tasks. The last category is mostly driven out of curiosity to see how DreamZero behaves on non-standard manipulation tasks.
Table of Contents
Policy Overview
DreamZero introduces a new class of robot foundation model, the World Action Model, that departs from the VLA paradigm in multiple ways, including architecture and training data. Before moving to the experiments, it is worth highlighting some of these differences.
Architecture
π₀.₅ inherits its backbone from the Vision-Language Model, PaliGemma, pretrained on static image-text pairs. DreamZero initializes from Wan2.1-I2V-14B, a 14-billion-parameter image-to-video diffusion model pretrained on web-scale video data. The practical consequence is that DreamZero enters robot training already knowing how objects and scenes evolve.

As shown in Figure 1, DreamZero feeds three inputs into a shared Diffusion Transformer (DiT): visual context encoded via a Variational Autoencoder; language instructions via a text encoder; and proprioceptive state via a state encoder. The DiT jointly denoises both future video frames and actions through separate output heads, motivated by the hypothesis that a model forced to predict plausible visual futures internalizes physical dynamics and thus makes better action predictions.
Training Data
π₀.₅ is built on a PaliGemma backbone and co-trained with heterogeneous data sources, such as the Open X-Embodiment (OXE) dataset, custom-collected robot data from mobile and non-mobile robots across diverse home environments, and non-robotic data such as CapsFusion and COCO.
DreamZero, on the other hand, does not use any non-robotic data. Instead, DreamZero only trains on robot data specific to the experimental platforms used for testing, an AgiBot G1 bimanual robot and a Franka single-arm robot. For the AgiBot G1, the authors collected approximately 500 hours of teleoperation data across 22 real-world environments (homes, restaurants, supermarkets, and offices). For the Franka arm, the authors trained on the publicly available DROID dataset, which was collected exclusively with the Franka arm.
To compare π₀.₅ against DreamZero, I use the DreamZero DROID checkpoint and the finetuned checkpoint publicly released by Physical Intelligence, which adapts the base π₀.₅ model to the DROID environment.
Experimental Setup
To evaluate the policies, I apply both of their publicly released DROID checkpoints in sim-eval’s DROID environment, which is based on Isaac Sim/Lab. I specifically use the first scene of sim-eval, which places a Franka arm at a fixed table with a Rubik’s Cube and a red plastic bowl (see Figure 2). During inference, I record the camera observations that are used as input to the policies. DreamZero uses three cameras: a right external camera, a wrist camera, and a left external camera, each recorded at 448 × 448 pixels. π0.5 uses two cameras: an external camera and a wrist camera. These cameras are rendered from the raw simulator output at 720 × 1280 RGB rather than the model’s 224 × 224 input.

Tasks
Each policy receives 12 task instructions. For each task, both policies have 30 seconds to complete the task. Due to the policies’ probabilistic nature, each task is repeated three times. The 12 tasks are:
Basic manipulation, Tasks 1–4:
- “Pick up the cube and place it in the bowl”
- “Pick up the bowl and place it on the Rubik’s Cube”
- “Move the cube to the left side of the table”
- “Move the bowl to the right side of the table”
Advanced Semantics, Tasks 5–8:
- “Push the cube to the left with the outside of the gripper”
- “Place the smaller object inside the larger object”
- “Place the Rubik’s Cube on the side so the red center square is facing up”
- “Place the cube at the corner of the table closest to the sofa”
Adversarial / Non-Manipulation, Tasks 9–12:
- “Lift the table”
- “Throw the bowl across the room”
- “Look out the window”
- “Slam your gripper into the table”
Inference Backends
To apply the policies in sim-eval, I send observations and receive action chunks via a server-client connection. In the DreamZero case, the policy is hosted by NVIDIA and receives observations and sends action chunks via a WebSocket API. For π0.5, I use a similar setup, except π0.5 is hosted locally with openpi.
Simulation Parameters
The IsaacLab DROID environment runs at 120 Hz with decimation = 8, meaning each action is held for 8 physics ticks, giving a control frequency of $\Delta t_{\text{action}} = 8 \times (1/120) \approx 15\,\text{Hz}$. The two policies differ in how many steps they execute per action chunk:
- DreamZero uses
open_loop_horizon = 24, so each chunk spans $24 / 15 \approx 1.6\,\text{s}$ of control. The model receives one wrist camera and two external cameras, all at 180 × 320 pixels. - π0.5 uses
open_loop_horizon = 8, so each chunk spans $8 / 15 \approx 0.53\,\text{s}$ of control.
For the experiments, I use the default parameters provided by the authors’ evaluation codebases.
Evaluation
Each task consists of 3 independent episodes, resulting in 36 episodes total. Each episode is scored on a pass/fail basis using a qualitative criterion: if a policy’s behavior broadly corresponds to the task instruction, regardless of execution precision, the episode is counted as a success. Partial attempts that do not meaningfully satisfy the instruction are scored as failures. When both policies succeed in the same episode, the verdict is recorded as a Tie; when both fail, it is recorded as a Fail.
Category 1: Basic Manipulation
The first category evaluates both policies on basic manipulation tasks. The videos show the observations that the policies receive during inference.
Task 1: “Pick up the cube and place it in the bowl”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Succeeds. Picks it up again later | Succeeds | Tie |
| 2 | Succeeds. Tries picking up again | Succeeds | Tie |
| 3 | Fails twice, tries to recover | Succeeds | π0.5 |
DreamZero mostly succeeds but occasionally seems hesitant. π0.5 executes this task without problems.
Task 2: “Pick up the bowl and place it on the Rubik’s Cube”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Picks up bowl and drops on cube, Bowl falls to ground | Picks up bowl and drops on cube. Bowl falls to one side and leans | Fail |
| 2 | Picks up bowl, drops on cube, misses, then tries to recover | Similar as previous episode | Fail |
| 3 | Picks up bowl and drops on cube. Bowl ends up halfway on cube | Similar as previous episode | Fail |
Both models struggle with secure placement due to the small surface area of the Rubik’s Cube.
Task 3: “Move the cube to the left side of the table”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Succeeds | Barely moves cube to the left side of the table | Tie |
| 2 | Succeeds | Similar to the previous episode | Tie |
| 3 | Succeeds | Succeeds | Tie |
DreamZero successfully moves the Rubik’s Cube to the left side of the table for the three episodes. π0.5 moves the cube leftward in all episodes. However, for the first two episodes, the cube barely moves to the left. Regardless, since in both episodes the cube appears more on the left side of the table than the right side of the table, I score all three episodes as successful.
Task 4: “Move the bowl to the right side of the table”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Succeeds; keeps interacting with the bowl afterwards | Moves bowl but not to the right | DZ |
| 2 | Similar to the previous episode | Moves bowl to the right, less far but correct side | Tie |
| 3 | Similar to the previous episode | Moves bowl to the right and drops it | Tie |
DreamZero succeeds in all episodes. π0.5 achieves the correct direction in episodes 2 and 3.
Category 2: Advanced Semantics
These tasks test whether the models can ground abstract semantic descriptions such as relative size, color, and spatial landmarks into correct actions. Unlike canonical pick-and-place examples, these tasks require reasoning about object properties or following constraints that might be outside of the original training distribution.
Task 5: “Push the cube to the left with the outside of the gripper”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Slightly pushes cube with outside of gripper; Motion from wrist camera view is to the left; However, motion from the external camera view is to the right | Tries to grab the cube, then touches it with outside of gripper; no pushing motion | Fail |
| 2 | Tries to push cube but fails | Grabs the cube; not using outside of gripper | Fail |
| 3 | Tries to push cube but fails | Grabs cube again and moves it | Fail |
DreamZero shows a genuine understanding of the “outside of the gripper” constraint in all episodes. Interestingly enough, in episode 1, DreamZero slightly pushes the cube. However, the direction is to the right from the wrist camera’s viewpoint and to the left from the external cameras’ viewpoints. Since DreamZero used the external cameras as a reference in the previous tasks, episode 1 is a failure. Nevertheless, a human might struggle on this task as well due to the ambiguity of the viewpoints in the task instruction.
π0.5 reverts to its default grasp-and-move behavior, ignoring the gripper constraint entirely.
Task 6: “Place the smaller object inside the larger object”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Succeeds | Succeeds | Tie |
| 2 | Succeeds | Succeeds | Tie |
| 3 | Succeeds | Succeeds | Tie |
Both models correctly place the cube in the bowl. Whether this reflects genuine relative-size reasoning or a common pick-and-place pattern from DROID training is hard to tell from this task alone.
Task 7: “Place the Rubik’s Cube on the side so the red center square is facing up”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Fails to turn cube | Fails to manipulate cube | Fail |
| 2 | Moves the cube with one swift movement | Fails to manipulate cube | DZ |
| 3 | Turns cube but wrong side | Fails to manipulate cube | Fail |
The color-conditioned orientation task requires both fine manipulation and pose reasoning. π0.5 does not engage the cube at all. DreamZero tries to turn the cube each episode and succeeds in turning it to the correct side in episode 2.
Task 8: “Place the cube at the corner of the table closest to the sofa”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Doesn’t place cube close to sofa | Does not pick up cube, freezes | Fail |
| 2 | Doesn’t place cube close to sofa | Freezes | Fail |
| 3 | Doesn’t place cube close to sofa | Freezes | Fail |
Both models fail. DreamZero picks up the cube but does not place it at the correct location. π0.5 does not pick up the cube at all.
Category 3: Adversarial / Non-Manipulation
This category includes tasks that are either physically impossible given the environment (Task 9, since the table is fixed) or require behaviors well outside the manipulation domain the policies were trained on. Though it is unlikely a robot would ever be asked to perform these tasks in the real world, I was curious to see how the policies would behave, given that a child would intuitively understand what to do.
Task 9: “Lift the table”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Makes an effort to go to the side of the table and attempts to lift | Focuses on the cube | DZ |
| 2 | Makes another effort, tries gripping the end of the table | Focuses on the cube | DZ |
| 3 | Tries grabbing the far edge of the table | Focuses on the bowl | DZ |
Since the table is fixed in the environment, this task tests whether the policy makes a reasonable effort. DreamZero clearly tries to manipulate the table. π0.5 falls back to manipulating the objects on the table.
Task 10: “Throw the bowl across the room”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Picks up bowl and drops it. No throwing | Picks up bowl, moves it across the table, then drops it. No throwing | Fail |
| 2 | Picks up bowl and drops it. No throwing | Similar as previous episode | Fail |
| 3 | Picks up bowl and drops it. No throwing | Similar as previous episode | Fail |
Neither model generates a throwing motion.
Task 11: “Look out the window”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Looks out the window | Moves toward window, but gets distracted by the table | DZ |
| 2 | Looks out the window | Briefly tries to look toward window, then returns to table | DZ |
| 3 | Looks out the window | Looks around the room, then back at the table | DZ |
DreamZero succeeds in all three episodes, an indication of its video-grounded world understanding. π0.5 tries occasionally but gets distracted by the table.
Task 12: “Slam your gripper into the table”
| # | DreamZero (DZ) | π0.5 | Verdict |
|---|---|---|---|
| 1 | Doesn’t touch the table nor any fast movement | Doesn’t touch the table nor any fast movement | Fail |
| 2 | Similar as previous episode | Similar as previous episode | Fail |
| 3 | Similar as previous episode | Similar as previous episode | Fail |
This task tests how the policies react to adversarial instructions. Neither policy touches the table nor makes any movement that resembles slamming the gripper.
Conclusion
The table below summarizes results across all 36 episodes.
| # | Task | DZ Wins | π0.5 Wins | Ties | Both Fail |
|---|---|---|---|---|---|
| 1 | Pick up cube and place in bowl | 1 | 2 | ||
| 2 | Pick up bowl and place on Rubik’s Cube | 3 | |||
| 3 | Move the cube to the left side of table | 3 | |||
| 4 | Move the bowl to the right side of the table | 1 | 2 | ||
| 5 | Push cube left with outside of gripper | 3 | |||
| 6 | Place the smaller object inside the larger object | 3 | |||
| 7 | Orient cube so red center square faces up | 1 | 2 | ||
| 8 | Place cube at corner closest to sofa | 3 | |||
| 9 | Lift the table | 3 | |||
| 10 | Throw the bowl across the room | 3 | |||
| 11 | Look out the window | 3 | |||
| 12 | Slam your gripper into the table | 3 | |||
| Total | 8 | 1 | 10 | 17 |
From the qualitative tests, several observations stand out.
The two policies perform similarly on basic manipulation, with results largely consistent across Tasks 1, 3, and 4. Manipulation requiring precision, as in Task 2, causes both to fail. Both models also show some spatial understanding, as reflected in Tasks 3 and 4, though it is hard to assess for π₀.₅ in Task 3, given the qualitative nature of the tests. DreamZero further shows confusion about which direction it treats as left and right, as seen in Task 5.
Both models show some semantic understanding, as seen in Task 6, with DreamZero having a slight edge in Task 7. Most semantic tasks, however, prove too challenging for either policy: Tasks 5, 7, and 8 all end in failure.
On the adversarial and non-manipulation tasks, π₀.₅ fails completely. DreamZero, however, succeeds on Tasks 9 and 11, giving the strongest indication that the Wan2.1-I2V-14B backbone transfers to tasks DreamZero has never seen before. What is surprising is that DreamZero comes up with reasonable actions in Tasks 9 and 11 but completely fails in Tasks 10 and 12, even though neither appears to be semantically harder.
Key Takeaways
DreamZero and π₀.₅ show comparable overall manipulation skills, but both seem to struggle on high-precision tasks, a limitation the DreamZero authors also noted.
DreamZero can adapt better to unseen and non-manipulation tasks. π₀.₅ is clearly limited and often reverts to pick-and-place behavior regardless of instruction, an observation the DreamZero authors also made.
Both models show some spatial and semantic understanding, but also feature clear limitations. Both perform poorly on more complex, semantically challenging tasks, such as “placing the cube at the corner of the table that is closest to the sofa”.
DreamZero succeeds at some non-manipulation tasks. However, in other tasks like “throw the bowl across the room”, DreamZero is unable to produce any throwing motion, which seems surprising given its image-to-video diffusion backbone has likely seen throwing motions in training.
Since the tests were qualitative and limited in sample size, none of these observations or takeaways should be treated as definitive claims. More analysis is needed, which I leave for another day. If you have comments or questions, reach out to me on LinkedIn.
References
- DreamZero: World action models are zero-shot policies — S. Ye et al., arXiv:2602.15922, 2026
- π₀.₅: A vision-language-action model with open-world generalization — Physical Intelligence et al., arXiv:2504.16054, 2025
- DROID: A large-scale in-the-wild robot manipulation dataset — A. Khazatsky et al., arXiv:2403.12945, 2024
- PaliGemma: A versatile 3B VLM for transfer — L. Beyer et al., arXiv:2407.07726, 2024
- Isaac Sim — NVIDIA
- Wan: Open and advanced large-scale video generative models — Team Wan et al., arXiv:2503.20314, 2025
- Open X-Embodiment: Robotic learning datasets and RT-X models — Open X-Embodiment Collaboration, arXiv:2310.08864, 2023
- CapsFusion: Rethinking image-text data at scale — Q. Yu et al., CVPR 2024
- Microsoft COCO Captions: Data collection and evaluation server — X. Chen et al., arXiv:1504.00325, 2015
- sim-evals — A. Jain et al., GitHub
- openpi — Physical Intelligence, GitHub