Two-Legged Spot: Teaching a Quadruped to Only Use Its Hind Legs for Locomotion
Spot moving on its hind legs after 38,000 training iterations.Spot is a quadruped robot built by Boston Dynamics. Its four-legged morphology is well-suited for stable locomotion across uneven terrain. However, we wanted to find out whether Spot could join the humanoid wave and move using only its hind legs.
To that end, we trained a control policy entirely in simulation: a neural network that maps sensor observations to joint commands, optimized over tens of thousands of iterations against a shaped reward function. The trained policy was then transferred to physical hardware. Most of the engineering work was not in the RL algorithm itself, but in constructing a reward function that produced the behavior we actually wanted.
The following sections first show how we trained a normal quadruped walking policy and then gradually pushed Spot onto its hind legs through reward shaping. From there, we cover the iterations of engineering design decisions we had to make to elicit actual standing behavior. In the end, we show how we deployed our trained policy in the real world to assess how well it would transfer from simulation.
Table of Contents
The Simulation Stack: Isaac Lab and RSL-RL
Training a locomotion policy on real hardware is impractical: a single policy update requires thousands of environment interactions, each of which takes wall-clock time and risks mechanical damage. The standard approach is to train entirely in simulation and transfer the resulting policy to the real robot, a process known as sim-to-real transfer.
We used NVIDIA Isaac Lab, a GPU-accelerated robotics simulation framework built on top of PhysX. Because the entire physics simulation runs on the GPU, Isaac Lab can run thousands of independent environments simultaneously. In our setup, we ran 8,192 parallel copies of Spot during training — each instance collecting experience independently, all contributing to the same policy update (Figure 1).

For the RL algorithm, we used Proximal Policy Optimization (PPO) from the RSL-RL library, developed at ETH Zürich’s Robotic Systems Lab for legged locomotion research. PPO is an on-policy actor-critic method: the policy (actor) proposes actions, a value function (critic) estimates future returns, and both are updated together using a clipped surrogate objective that prevents destabilizingly large parameter updates. Each training iteration collects a fresh batch of experience across all 8,192 environments. With num_steps_per_env=24, that is roughly 196,000 transitions per update step. The algorithm then runs 5 optimization epochs over that batch before discarding it.
All training was performed on a single workstation with an Intel Core i9-14900HX @ 2.2 GHz,64 GB RAM, and an NVIDIA GeForce RTX 4090 (Laptop).
Getting Spot to Walk
Before attempting bipedal locomotion, we established a working baseline: train Spot to walk normally on all four legs following velocity commands. The baseline run produced a checkpoint for subsequent warm-starting and confirmed that the simulation setup was correct.
Reward Design for Quadruped Locomotion
In Isaac Lab’s locomotion framework, behavior is specified entirely by a reward function — a weighted sum of roughly 15 scalar terms evaluated at every environment step. The policy never receives task instructions; it only sees observations and experiences scalar consequences. Getting the reward right is the central engineering task.
For quadruped locomotion, the reward structure divides into two categories. Task rewards incentivize the desired behavior: base_linear_velocity and base_angular_velocity reward tracking the commanded velocity using exponential kernels ($r = e^{-\text{err}/\sigma}$, where $\text{err}$ is the velocity tracking error and $\sigma$ is a softness parameter), air_time shapes gait timing by rewarding each foot for spending close to 0.3 seconds in the air or in contact per stride, and foot_clearance encourages proper swing height. Regularization penalties constrain the motion: action_smoothness penalizes jerky control, joint_torques and joint_vel discourage energetically wasteful motion, foot_slip penalizes sliding contacts, and base_motion discourages unnecessary vertical bouncing.
Training the Baseline Locomotion Policy
With this reward structure in place, we ran training from a random initialization. The progression is typical of locomotion RL: at initialization the policy produces random joint commands and the robot immediately falls (Figure 2); within a few hundred iterations it finds that hopping is locally rewarding but has not yet discovered an efficient gait (Figure 3); by iteration 6,000 Spot walks cleanly and follows velocity commands in both forward and backward directions (Figure 4). The reward curve (Figure 5) shows steady convergence, reaching a mean episode reward near 380.




Nudging Spot Off Its Front Legs
With a quadruped locomotion policy as a starting point, the next step was to introduce pressure toward bipedal behavior.
The Primary Shaping Signal: Front Feet Contact Penalty
We added a single new reward term — front_feet_contact_penalty with weight −5.0 — that penalizes the robot for every timestep either front foot is in contact with the ground. Contact is detected by checking whether the net force magnitude on the front feet exceeds a 1 N threshold using the contact force history buffer.
A weight of −5.0 is large relative to the existing reward function. The base_linear_velocity tracking reward tops out around +5.0; the new penalty is large enough that keeping both front feet on the ground costs more than the entire velocity tracking reward. Figures 6 and 7 show the early response to this signal: Spot begins attempting to push itself upright but cannot sustain the posture.


Adjusting the Command Distribution
Alongside the penalty, we modified the training command distribution. In standard quadruped training, environments are sampled with a mix of velocity commands and standing commands. For hind legs training, we increased the standing fraction to 80% (rel_standing_envs=0.8) and tightened the velocity range to ±0.15 m/s lateral and ±0.15 rad/s yaw. Learning to balance on two legs from a near-stationary position is a prerequisite for locomotion in that posture. We also disabled the GaitReward term (weight set to 0.0), which enforces quadruped diagonal-pair timing. The diagonal-pair constraint is incompatible with bipedal locomotion.
Figure 8 compares the reward curves of the quadruped run and the first hind legs run. The mean reward for the hind legs run is substantially lower than the quadruped baseline, which reflects two things: the reward function was significantly changed (the GaitReward disabled, several weights reduced or zeroed), and with 80% of environments commanding Spot to stand rather than move, the velocity tracking rewards fire far less often.

The Reward Engineering Gauntlet
With the front feet penalty in place, Spot had learned to push itself briefly off the ground, but it could not convert that into a stable bipedal stance. It would push up, wobble, and fall back down, making no lasting progress toward standing on two legs. The challenge was understanding what was preventing the transition from brief upright moments to sustained balance, and adjusting the reward function accordingly.
Training Instability in Early Runs
The first major failure mode was not a clever exploit but straightforward instability. Rather than converging toward an upright posture, early runs produced a policy that repeatedly faceplanted: Spot would attempt to push its front feet off the ground, lose balance, and fall forward without making any lasting progress toward standing (Figure 9).

We do not have a definitive explanation for why this failure mode emerged. The most plausible interpretation is that the reward landscape, under the combination of heavy front feet penalty and various conflicting regularization terms, became sufficiently unstable that faceplanting minimized several simultaneous penalties more effectively than any posture involving extended upright behavior. The instability suggested two directions: add explicit safety constraints to penalize catastrophic falls, and audit the remaining reward terms for conflicts with the target posture.
Rewards That Were Fighting the Goal
Several of the existing regularization terms were actively penalizing the posture we were trying to achieve. Standing upright on two legs requires Spot to pitch its body backward by 50–80°, shift its center of mass over the hind feet, and hold the front legs elevated — all deviations from the default quadruped stance that the original reward function penalized.
The most significant conflicts:
base_orientation (weight=−3.0) penalized the base frame leaning away from gravity using the projected gravity vector. Pitching back to stand upright is exactly this motion. We relaxed this term to penalize only roll (lateral tipping), leaving pitch unconstrained.
base_motion (weight=−2.0) penalized vertical velocity and roll/pitch rates. Any dynamic behavior involving pushing off the ground — necessary to get upright — triggers this term. Weight reduced from −2.0 to −0.5.
air_time (weight=+5.0) rewarded quadruped gait timing using all four feet. With front feet permanently airborne, this term produced a misleading gradient. Weight set to 0.0.
air_time_variance (weight=−1.0) penalized asymmetric air time across feet. With front feet permanently airborne, this penalty was inescapable. Weight reduced to −0.5.
joint_pos (weight=−0.7) penalized deviation from the default quadruped joint configuration, with an additional 5× multiplier during standing. The front shoulder and elbow joints must deviate substantially from their defaults to keep the front feet elevated. Weight reduced to limit the influence of this term.
A reward function designed for quadruped locomotion encodes implicit assumptions about posture that are incompatible with bipedal locomotion. Each term had to be audited against the target behavior.
Building the Pitch Upright Reward
Our original concern with using a strong front feet penalty alone was that it might cause Spot to avoid its front legs entirely and catastrophically faceplant — avoiding ground contact by pitching forward rather than by standing upright. To provide a positive gradient toward the desired posture, we experimented with a base_pitch_upright_reward: a piecewise function of body pitch angle θ that ramps from 0 to 1 between 0° and 50°, plateaus through 80°, then decays back to 0 and into a penalty beyond 90°. A height gate restricted the bonus to body heights between 0.5 m and 0.8 m to prevent exploits from crouched or tilted positions.
In the end, this reward was not needed. Once the safety constraints (described below) were in place, the front feet penalty alone was sufficient to drive the policy toward the target posture without requiring an explicit pitch signal. We reverted to the simpler design. The key insight was that the front feet penalty could safely be made aggressive, as long as the safety penalties were stronger eliminating the catastrophic failure modes that the pitch reward was originally meant to prevent.
Adding Safety: Height and Pitch-Over Penalties
The safety terms described above are worth stating precisely, as they play a structural role distinct from the reward-shaping terms:
- A fall-over penalty triggers when body pitch exceeds 90°, applying a large negative reward. Without it, the optimizer can harvest pitch-reward transiently by tumbling through vertical.
- A body height penalty triggers when the base drops below 0.4 m. The height constraint prevents equilibria where Spot achieves the correct pitch angle from a collapsed or crouched position rather than a true standing one.
The safety terms do not shape the gait; they define a feasibility boundary that excludes failure states from the reward landscape.
Tuning Exploration to Escape Local Optima
After several runs where the policy converged to low-quality equilibria — standing briefly then collapsing without making further progress — we increased the exploration parameters of the PPO policy:
entropy_coeffrom 0.0025 → 0.005: the entropy bonus discourages premature convergence of the policy distribution.init_noise_stdfrom 1.0 → 2.0: the initial standard deviation of the Gaussian action distribution. Higher values produce more diverse early actions, increasing the probability of discovering useful behaviors.
The exploration parameter changes were applied via warm-starting from the best available checkpoint of the preceding run. Warm-starting preserved existing knowledge of how to push upright; increased exploration provided a better chance of discovering how to sustain and locomote in that posture.
Standing Behavior Emerges
With the full set of changes in place — safety penalties bounding the failure states, the conflicting regularization terms relaxed or removed, and increased exploration via warm-starting from the best available checkpoint — Spot finally began learning to stand and move on its hind legs. The reward engineering work from the previous section had eliminated most of the obstacles: the policy was no longer penalized for pitching backward, the front feet penalty was high enough to consistently drive lift, and catastrophic falls now incurred a strong negative signal that prevented the optimizer from dwelling in failure modes.
What 38,000 PPO Iterations Look Like
Figures 10–12 show the progression from the first sustained standing behavior through the final policy.



The final policy was produced across four training runs: the initial quadruped run (~6,000 iterations), two hind legs runs (~5,000 and ~23,000 iterations), and a final warm-started run totaling 38,000 iterations. Each run was warm-started from a checkpoint of the previous run where the training had stabilized. For the third run, we reduced the standing fraction to 50% (rel_standing_envs=0.5) and increased the velocity ranges to ±0.5 m/s lateral and ±0.2 rad/s yaw. For the fourth run, we kept 50% standing but expanded the velocity range further to ±1.0 m/s and ±0.5 rad/s yaw to encourage more aggressive bipedal locomotion.
The goal was to have Spot stand and move on its hind legs with the front legs elevated — and the policy achieves exactly that. Spot bounces on both hind legs as it locomotes, with compensatory movements from the front legs to maintain balance.
Reading the Reward Curves
Figures 13 and 14 show the reward signals over the full training history.


The front_feet_contact penalty started near −1.5 per step at the beginning of hind legs training and converged to approximately −0.16, indicating the policy keeps the front feet off the ground most of the time with only occasional brief contacts. The action_smoothness penalty improved from −0.9 to −0.45, reflecting convergence toward smoother joint commands. The pitch-over graph (Figure 14) is particularly informative: the monotonic decrease in fall-over penalty signal indicates the policy is learning to recover from near-falls rather than simply avoiding the posture.
Deploying on the Real Spot
We deployed the 38,000-iteration checkpoint on physical Spot hardware. The video below shows three consecutive trials, each initiated by a velocity command issued through a PlayStation controller. Each trial demonstrates the bipedal behavior: Spot rises onto its hind legs and hops in the commanded direction, maintaining the upright posture throughout.
Transferring a policy trained in simulation to real hardware is non-trivial. The sim-to-real gap — the performance degradation that arises when a policy meets real hardware — comes from sources the simulator cannot faithfully replicate: sensor noise, actuator delay, joint backlash, and contact dynamics. Policies that look good in simulation fail on hardware when they have implicitly relied on simulator-specific behaviors.
The Isaac Lab training setup targets the sim-to-real gap through two main mechanisms. Domain randomization applies stochastic variation to physical parameters at each episode reset, so the policy must generalize across a distribution of worlds rather than memorize one fixed simulation. The specific parameters randomized during training include terrain roughness, rigid-body surface material properties, base and body mass, and externally applied force/torque disturbances on the base. Randomized reset states ensure the policy never depends on a canonical starting configuration.
The transfer was successful (see Video 1). Spot maintained bipedal posture on real hardware and demonstrated the same hind-leg locomotion seen in simulation. The real-world behavior was stable overall, with two minor differences from the simulated counterpart: the hind legs could not lift the robot quite as high as in simulation, and Spot would occasionally touch down with its front legs before recovering to the upright posture. Neither difference limited the core behavior.
References
- NVIDIA Isaac Lab — GPU-accelerated robotics simulation framework used for training
- RSL-RL — PPO library from ETH Zürich’s Robotic Systems Lab for legged locomotion
- YouTube — Real-World Demo — Three deployment trials on physical Spot hardware