Xingxing Wang's Contribution, Unitree Robots Evolve After Spring Festival Gala: A Single Policy Can Learn Various Extreme Motions
Editor|Panda
At the Spring Festival Gala, the martial arts performance "Wu BOT" by Unitree's robots left a deep impression. During the performance, the humanoid robots G1 and H2, while running at high speed, completed interweaving formation changes and martial arts movements, demonstrating high-dynamic, high-coordination, fully autonomous swarm control technology.
Now, a new study by institutions including Beijing Institute for General Artificial Intelligence (BIGAI), Unitree, Shanghai Jiao Tong University, and the University of Science and Technology of China has advanced further in this direction, proposing OmniXtreme: the first generalist policy that can perform a variety of extreme motions, including consecutive flips, extreme balancing, and even breakdancing through rapid contact switching.
The process of achieving this capability involves first pretraining a flow-based generative control policy, followed by post-training with actuation-aware residual reinforcement learning for complex physical dynamics. This post-training step is crucial for successful real-world transfer.
Siyuan Huang, a corresponding author of the project and a research scientist at BIGAI, stated on X: "We spent a full year digging into the barriers between generalizable tracking and extreme physical behaviors. After testing dozens of G1 robots, we finally identified the bottlenecks in both learning and physical execution."
It is worth noting that Xingxing Wang, co-founder and CEO of Unitree, is also among the authors of the paper. The co-first authors are Yunshen Wang and Shaohang Zhu.
Paper Link: https://arxiv.org/abs/2602.23843
Project Page: https://extreme-humanoid.github.io
Code Repository: https://github.com/Perkins729/OmniXtreme
Method: Breaking the Generality Barrier in High-Dynamic Control
In the field of humanoid robot motion control, researchers have long faced a dilemma known as the "generality barrier." As the scale and diversity of motion libraries increase, traditional unified reinforcement learning policies often suffer performance collapse, which is particularly evident in the physical deployment of high-dynamic motions. This collapse stems from two compounding bottlenecks: the learning bottleneck in simulation (gradient interference in multi-motion optimization) and the physical execution bottleneck (complex actuation constraints in the real world).
To fundamentally address this issue, the research team proposed the OmniXtreme framework. This framework cleverly decouples motion skill learning from physics-driven fine-tuning, dividing it into two core phases: "flow-based scalable pretraining" and "actuation-aware residual post-training."
Stage One: Flow-Based Scalable Pretraining
In the first stage, the research team aimed to endow the model with extremely high representational capacity, enabling it to master a large number of heterogeneous extreme motions while avoiding the common conservative averaging tendency in traditional multi-motion reinforcement learning.
Researchers first integrated multiple high-quality motion datasets such as LAFAN1, AMASS, and MimicKit, and retargeted them onto the Unitree G1 humanoid robot. For these reference motions, the team trained a series of expert policies using the PPO algorithm. Subsequently, OmniXtreme employed Dataset Aggregation (DAgger)-based knowledge distillation to unify the behaviors of these expert policies into a single flow-matching-based generative policy.
Mathematically, the flow-based model learns to recover expert actions from pure noise by optimizing the following objective function:
LFM(θ)=Et,ϵ,aexpert[∥vθ(at,t,o)−(ϵ−aexpert)∥2]In the above formula, a_t represents the interpolated action between the expert action a_{expert} and random noise ε at flow time step t. This objective function allows the model to learn a velocity field v_θ, enabling the generation of high-precision continuous control actions through forward Euler integration during inference. To ensure physical stability, the team introduced only moderate noise and domain randomization at this stage, ensuring the policy accurately captures the underlying physical dynamics.
Stage Two: Actuation-Aware Post-Training
Although the pretrained flow-matching policy shows impressive tracking accuracy in simulation, the nonlinear characteristics of motors in the real world often cause this high-dynamic performance to fall short. To achieve smooth "sim-to-real" transfer, the team froze the pretrained base policy and trained a lightweight MLP residual policy on top of it. This residual policy does not need to relearn motion tracking; its primary function is to output corrective actions to counteract real hardware constraints.
To make the residual policy truly understand the harshness of the physical world, the team introduced three levels of deep modeling in the training environment:
Aggressive Domain Randomization: Researchers significantly increased the range of common domain randomization parameters, such as initial posture noise, external force disturbance magnitude, and angular velocity, by up to 50%. More critically, they widened termination thresholds by 1.5 times (e.g., relaxing the torso orientation error tolerance from 0.8 radians to 1.2 radians). This design provides the residual policy with ample exploration space, allowing it to learn to perform extreme recovery from large deviations, greatly enhancing the system's robustness.
Power-Safety Actuation Regularization: When performing high-dynamic motions like backflips, robots generate enormous transient braking loads. Conventional reinforcement learning pipelines often lack constraints on such loads, making them prone to triggering overcurrent protection or thermal stress shutdowns on real hardware. OmniXtreme innovatively introduces a penalty mechanism targeting mechanical power, with its core being the calculation of the product of joint torque and angular velocity, i.e., instantaneous mechanical power P = τ ・ ω. For high-magnitude negative power (regenerative braking) exceeding a safe dead zone, the team applies a strict quadratic penalty function:
Lneg-power=j∈J∑(Kmax(−Pj−Pdb,0))2In practice, this penalty is primarily applied to the knee joints, as they are most susceptible to destructive braking loads during impact and recovery phases.
Actuation-Aware Torque and Speed Constraints: Simple torque clipping often overlooks speed-dependent physical limits caused by back-EMF. The team integrated the real motor operating envelope directly into the simulator, defining an allowable torque function that monotonically decreases with the magnitude of joint speed. Furthermore, the system models internal losses at the actuator level through a nonlinear friction term:
τapplied=τclipped−(μstanh(vactv)+μdv)This formula accurately captures the smooth transition from static to dynamic friction and calculates speed-dependent dissipative damping.
Purely Onboard Real-time Deployment
In terms of hardware deployment, OmniXtreme demonstrates exceptional engineering completeness. The entire inference pipeline (including forward kinematics-based state estimation, the flow-matching base policy, and the residual policy) is deeply optimized using TensorRT. On the onboard NVIDIA Jetson Orin NX platform of the Unitree G1 humanoid robot, the system achieves an end-to-end inference latency of about 10 ms, perfectly supporting 50 Hz high-frequency closed-loop control.
Experimental Performance: Comprehensive Extreme Testing
To comprehensively evaluate the scalability and robustness of OmniXtreme, the research team not only used the standard LAFAN1 motion library but also meticulously selected about 60 highly challenging motions to construct the XtremeMotion evaluation set. These motions include extremely high angular velocities, frequent contact switching, and stringent timing constraints.
Scalable High-Fidelity Tracking Capability
In simulation, OmniXtreme was directly compared with the traditional "from-scratch multi-motion RL" baseline and the "specialist-to-unified MLP distillation" baseline. The data shows that OmniXtreme outperforms on all metrics. Faced with the significantly more difficult XtremeMotion dataset, the tracking error of traditional methods increased notably, while OmniXtreme maintained very low kinematic errors and high success rates.
On the real-world Unitree G1 robot, the team selected 24 different high-dynamic motions from XtremeMotion for 157 physical tests. The tests covered multiple categories including backflips, acrobatics, breakdancing, and martial arts.
Finally, OmniXtreme achieved an overall average success rate of 91.08%. Among them, the success rate for flip-like motions was as high as 96.36%, martial arts motions reached 93.33%, and breakdancing motions also maintained a high level of 86.36%. This proves that the high-fidelity performance in simulation successfully crossed the reality gap. The following examples are shown: Thomas Flare, Helicopter, Forward Crawl, and Backflip. Breakdance. Martial Arts.
Breaking the Fidelity-Scalability Trade-off
To verify whether the system broke the generality barrier, the team designed a progressive stress test. They gradually expanded the training motion set from 10 to 20, and finally to 50 motions, and evaluated all policies on a fixed set of the first 10 motions.
The experimental results revealed significant differences. As motion diversity increased, the traditional from-scratch reinforcement learning baseline suffered severe performance degradation, with its success rate plummeting from 100% to 83.3%, and finally dropping to 73.9%. In contrast, OmniXtreme demonstrated remarkable resilience. Even with a massive training set of 50 motions, its tracking success rate for the core motions remained strong at 93.3%. This completely overturns the conventional belief that high fidelity must collapse as diversity increases.
Scaling Law of Model Size
In the development of artificial intelligence, increasing model parameters often leads to performance leaps, but this trend seemed to fail in the traditional motion control field. The team compared the performance of models with different parameter scales (20M, 50M, 70M).
The chart data clearly shows that traditional MLP policies quickly reach performance saturation after scaling up parameters, with tracking accuracy improvement being extremely limited. In stark contrast, the flow-matching-based generative policy perfectly adheres to the Scaling Law. As the parameters increased towards 70M, OmniXtreme's tracking accuracy and robustness showed significant and steady linear growth. This indicates that generative pretraining provides a viable evolutionary path for humanoid robot control systems.
Deep Ablation of Real-World Executability
What mechanisms endow the robot with such strong physical robustness? The team provided answers through ablation experiments.
For highly explosive tumbling motions (like backflips), simply introducing motor constraints is sufficient to ensure stable execution, as this avoids instantaneous collapse of the underlying hardware limits. However, for breakdancing motions involving high-frequency contact switching, the system must rely simultaneously on motor constraints and aggressive domain randomization to maintain timing-sensitive balance amid contact disturbances.
The most stringent challenge comes from acrobatic landing motions involving high-speed impact absorption. The team found that without the power-safety regularization mechanism, even if the model maintains balance in posture, failures would occur due to overcurrent or battery undervoltage caused by motor transient braking. This fully demonstrates that extreme agility in the real world must be built upon precise modeling of multi-dimensional physical constraints including sound, light, electricity, and heat.