ShengShu Technology Unveils World Action Model “Motubrain”: One Brain, Infinite Possibilities for Robotic Intelligence
From understanding and generating the world to taking action, Motubrain tops two global benchmarks and redefines the embodied AI landscape
ShengShu Technology announces Motubrain, a World Action Model that replaces multiple task-specific systems with a single, unified model that functions as a robotic brain for the physical world. Ranking highly on both WorldArena and RoboTwin 2.0, two of the field’s most rigorous benchmarks in embodied world models, Motubrain marks a decisive shift in an industry where robotic systems are typically built from task-specific or specialized systems.
Best known for its leading video model Vidu, ShengShu Technology and its advancements in generative video for robotics earmarks an industry first. Generative video has laid the foundation for simulating robots in real-world environments at scale. Motubrain builds on this by turning those simulations into action, by enabling robots to learn from diverse, large-scale pre-training data while reducing reliance on traditional physical data collection.
“A true world model must be able to build a unified representation of the real world and predict how it evolves,” said Jun Zhu, Founder of ShengShu Technology. “Video is a critical foundation of that intelligence because it naturally captures time, space, motion, causality, and physical dynamics at scale. We believe general world models should not be built as stitched-together modules, but as a unified architecture that brings together perception, reasoning, prediction, generation, and action in a single system. That is what can ultimately bridge the digital world and the physical world.”
Global Rankings: Among the Top Performers in Embodied AI
Motubrain has delivered top-tier performance on leading embodied AI benchmarks. Ranked among the industry’s best models for robotic perception, anticipation, and planning in the physical world, Motubrain achieved a 63.77 EWM Score on WorldArena. It has also been recognized as one of the strongest performers on RoboTwin 2.0, scoring an average of 96.0 across 50 predetermined tasks, and remains the only model to exceed 95.0 in randomized environments.
Also Read: AiThority Interview with Glenn Jocher, Founder & CEO, Ultralytics
The Architecture Behind the Breakthrough
Motubrain’s core breakthrough is unifying the “seen world” and the “actions to take” within a single model, and it is built on four core principles that together redefine what an embodied AI model for training robots can be:
- One Brain, Many Skills: A unified model that can handle a wide range of tasks and gets smarter and stronger as task variety increases. Training each skill one by one is no longer required, and unlike conventional models, the wider the range of complex tasks it handles at once, Motubrain’s success rate and reliability with multi-tasking increases.
- One Brain, Universal Across Robots: Motubrain isn’t built for a single robot model. It’s designed to be a universal brain that can power many kinds of robots. This breaks the old “one robot, one model” pattern. And as more robot types, real‑world scenarios, and data join the ecosystem, Motubrain keeps getting smarter, which in turn helps every robot in the network perform better.
- One Brain, End-to-End: Motubrain learns entire task sequences directly. It can handle complex, multi‑step tasks involving up to 10 atomic actions, also known as the smallest unit of movement in robotics, far beyond the typical 2–3 atomic actions. So the robot no longer sees isolated actions; it sees a complete, meaningful task from start to finish.
- One Brain, Able to Anticipate: Predicts the world while driving action. Environmental change, task progression, and execution are processed together inside one model, not assembled from separate subsystems.
To deliver this, Motubrain is built on a Unified Multimodal Model that treats video and action as two continuous modalities to be learned together. A single training run gives it five capabilities at once: vision-language-action control (VLA), world modelling, video generation, inverse dynamics modelling (IDM), and joint video-action prediction. A three-stream Mixture-of-Transformers (MoT) then brings video, action, and language together by drawing on the strengths of existing pretrained models, enabling Motubrain to understand environments, follow language instructions, predict what happens next, and generate actions all at the same time. Unlike systems that chain together separate perception, planning, and control modules, Motbrain processes the full loop.
Motbrain learns from a far broader range of data than conventional AI models that train robots, including unlabelled video, task recordings without language annotations, and data from different robot embodiments. A proprietary latent action framework extracts physical motion directly from large-scale video, including human footage, simulation data, and multi-robot task trajectories, without requiring the data to be labelled or tagged to indicate specific actions. This broader learning paradigm translates into strong scaling behavior. In task-scaling evaluations, Motubrain’s average success rate continued to rise as the number of training tasks increased, reaching approximately 92% at 50 tasks, while Pi-0.5 declined to roughly 68% over the same range. In data-scaling evaluations, Motubrain also maintained a clear advantage as the number of training episodes increased, achieving about 92% average success at 27,500 episodes, compared with roughly 85% for Motus and 68% for Pi-0.5. A three-stage pipeline built on a six-layer data pyramid lets Motubrain generalise skills across environments and robot types while remaining precise enough for fine-grained deployment scenarios.
Motubrain understands what is happening around it, anticipates what may happen next, and responds in real time. In real-world tests, robots trained with Motubrain have carried out complete, multi-step tasks with a level of adaptability beyond most conventional robotic systems. For example, they can insert flowers into a vase under changing conditions and use both arms independently for different goals. Most notably, Motubrain-trained robots demonstrate a remarkable ability to understand and predict outcomes during execution: when a ladle comes up empty while scooping, they can recognise that nothing has been collected and automatically attempt the scooping action again, despite never being trained on retry data. This marks the shift from robots that merely execute tasks to robots that truly complete them.
Training the Next Generation of Robots
Motubrain is not a research model awaiting commercialisation; it is operational. Several leading robotics companies are already using MotuBrain in active robot training programs, deploying its cross-embodiment, multi-skill capabilities on real hardware across industrial, commercial, and home environments.
To further enhance real-world performance, ShengShu has partnered with Astribot, SimpleAI, and Anyverse Dynamics to advance a general-purpose embodied AI brain, focusing on foundation model evolution, multimodal data integration, robust data infrastructure, and full-stack hardware–software optimisation.
Connecting the Dots: Alibaba’s Investment and Motubrain
Motubrain is ShengShu’s next strategic pillar, alongside Vidu, the company’s flagship generative video platform, which its recent Vidu Q3 ranked No.1 in the first global Reference-to-Video leaderboard released by SuperCLUE. The two products are distinct in application but continuous at the foundation: the same world model technology that makes Vidu one of the world’s leading video generation systems gives Motubrain its capacity to predict and act in the physical world. Where Vidu generates the world, Motubrain acts in it.
Backed by a $293 million Series B led by Alibaba Cloud and with investors including the China Internet Investment Fund, TAL Education Group, Baidu Ventures, and Luminous Ventures, ShengShu enters the Physical AI era as a leader, achieving successful live deployments and boasting the highest benchmarks for its unique ability to both deeply understand and effectively act upon its tasks.
Also Read: The Infrastructure War Behind the AI Boom
[To share your insights with us, please write to psen@itechseries.com ]

Comments are closed.