Grievous: A General-Purpose Household Robot

Low-Cost Bimanual Mobile Manipulation with Vision-Language-Action Models

Grievous — a modified version of the low-cost household robot XLeRobot.

We recently built Grievous, a low-cost mobile manipulation platform that is a modified version of the XLeRobot platform with onboard SO-101 leader arms for fast, in-place teleoperation. The key idea, inspired by Mobile ALOHA, is a practical deployment loop: the robot attempts tasks autonomously using a finetuned Vision-Language-Action (VLA) model, and when it fails or needs guidance, the user can immediately take over with the leader arms and provide a corrective demonstration. Over time, the robot accumulates in-distribution experience in its own home environment and should require progressively less intervention. This post details our current progress.

Platform Overview

Grievous is built on the XLeRobot hardware stack, which pairs two 5-DOF SO-101 follower arms (each with a 1-DOF gripper) with a wheeled mobile base. The software stack runs on Hugging Face’s LeRobot library, which provides pretrained VLA models, standardized dataset tooling, and training and deployment pipelines for real-world manipulation.

Grievous modifies XLeRobot in the following ways:

  • The follower arms are moved to the front of the cart, giving them better reach into the workspace directly in front of the robot.
  • Two SO-101 leader arms are added onboard, mounted in front of the follower arms in a mirrored configuration. The operator faces the robot and controls it directly, which requires a short adjustment period but is found to be intuitive in practice. Mounting the leader arms onboard removes the need to locate a VR headset or set up a remote laptop session. When the robot needs a demonstration, the user can intervene immediately. Evidence from prior work suggests that, in the contexts studied, physical leader-arm teleoperation tends to yield higher-quality demonstrations than remote interfaces (Bi et al., 2025).
  • The Intel RealSense head camera is moved to the front, though we later noticed that mounting it at the back improves workspace vision.
  • Two additional SO-101 arms are mounted at the bottom of the cart (not pictured). Since the cart cannot adjust its torso height, the lower arms allow Grievous to reach objects on the ground and at lower heights.

On the software side, the new embodiment required custom changes to the LeRobot-based control stack, including a grievous_host.py process running on the onboard Raspberry Pi 5. We use a 30 Hz recording frequency and a ZMQ-based inference pipeline that offloads VLA policy inference to a remote laptop.

SmolVLA Finetuning

Teleoperation and Data Collection

Demonstrations are collected using the onboard leader arms in a leader-follower setup. The operator faces the robot and directly controls the follower arms, with each leader arm mirroring the corresponding follower. Video 1 shows the teleoperation in action.

Teleoperation using the onboard leader arms. Teleoperation using the onboard leader arms.
Video 1: Teleoperation using the onboard leader arms.

For the pick-and-place task — picking up a small object and placing it in a box — we initially collected approximately 50 episodes. We later expanded the dataset to around 200 to improve coverage.

Training

SmolVLA is a compact Vision-Language-Action model from Hugging Face that exclusively uses data collected from SO-101 arms on the Hugging Face hub, making it an ideal VLA base for this platform. Crucially, at 450M parameters, it is small enough to retrain in a few hours on a laptop with an RTX 4090, making it well-suited for fast iteration on simple household tasks. We finetune SmolVLA on the pick-and-place demonstration dataset.

Current Progress

Results

After finetuning, Grievous can successfully pick up a purple hexagonal prism and place it in a box. Video 2 shows three consecutive trials.

Video 2: Pick-and-place trials with SmolVLA.

SmolVLA can also perform other short-horizon tasks after finetuning. Video 3 shows the robot picking up a duster and sweeping the table, a task for which we collected approximately 100 demonstration episodes.

Video 3: SmolVLA sweeps the table with a duster.

Challenges

Despite these successes, the SmolVLA policy Grievous also has limitations:

  • SmolVLA was mostly trained on single-arm task demonstrations. When deploying SmolVLA in a bimanual setup, it often prefers the arm whose joint-angle commands occupy the first 6 elements of the policy’s action vector. Thus, SmolVLA often performed poorly or significantly worse on the secondary arm.
  • SmolVLA often ignored parts of the task instructions, such as the color and shape of a specific object. For example, in Video 4, SmolVLA is tasked with picking up the green object. However, SmolVLA picks up the purple object directly in front of its right arm. This limitation is likely due to SmolVLA’s relatively small model size, which may not be expressive enough compared to state-of-the-art policies such as π₀.₅. Both challenges — arm preference and instruction following — will likely improve with a larger model trained on bimanual demonstration data.
Video 4: Policy picks up the purple object when instructed to pick up the green one.

Next Steps

The next step is to push Grievous into a real household environment and iteratively test simple cleaning tasks, such as the dusting shown above, on various surfaces across the apartment. The goal is to reduce the number of user interventions day by day, with the user only stepping in via the leader arms when the robot and external verifiers cannot recover on their own, and the resulting demonstrations feeding back into the next training iteration.

Related