2025/11/17

When Yushu's latest H2 humanoid robot gracefully dances to the rhythm, and when Zhiyuan robot G2's dexterous hand skillfully pulls the bow and shoots the arrow, these seemingly smooth "human-like movements" are actually supported by a core element that is often overlooked - motion capture technology.
At the 2025 IROS International Robot and Automation Exhibition, the optical-inertial hybrid motion capture device showcased by CHINGMU Qing Tong Vision became the highlight: It can precisely capture the subtle movements of finger joints with just 2 to 3 cameras, and can also work stably in environments with strong reflections such as glass and metal, effectively solving the industry pain points of traditional motion capture such as "data loss due to occlusion" and "inertial drift".
As a leading enterprise in the field of motion capture technology in China, Zhang Haiwei, the CEO of CHINGMU Qing Tong Vision, revealed in a recent exclusive interview with the Robot Lecture Hall that the value of motion capture technology for humanoid robots goes far beyond merely recording movements. It is not only a coach for teaching robots to walk and work, but also a judge for testing the performance of robots, and even a key infrastructure for solving the problem of scarce industry data and promoting the implementation of robots.
From the motion optimization of YuShu humanoid robots to the development of the dexterous hands of Zhiyuan robots, from the performance evaluation in the laboratory to the skill assessment in the factory, motion capture is becoming the invisible driving force that enables humanoid robots to progress from "laboratory prototypes" to "industrial products".

The Evolution Trilogy of Humanoid Robots
"Once humanoid robots are produced, they are just like newborn babies - they need to be taught and tested." Zhang Haiwei used a vivid analogy to highlight the core value of motion capture technology.
In the humanoid robot industry, motion capture plays a central role in the two key aspects of "training" and "evaluation", and these two aspects are further divided into three levels: motion intelligence, operational intelligence, and interactive intelligence. These levels progress step by step to enable robots to evolve from simply being "able to move" to "able to move well".
The "coach" role in motion capture is primarily manifested in helping robots develop their movement intelligence, which is the most fundamental survival ability for robots. It is the training end that enables robots to learn to act like humans. For example, the dance movements of Yushu H2 and the jumping postures of Expedition robots are essentially trained through motion capture equipment to record the large joint movement data of humans, such as the angle changes of the hip and knee joints, and then "replicate" this data onto the robots.
Zhang Haiwei explained that the core of motion intelligence is to enable robots to master balance and coordination, such as how the center of gravity shifts when walking and how joints cooperate when turning. All these data need to be collected through high-precision motion capture. What is more important than being able to run and jump is the robot's operational intelligence, that is, the ability to "do work". This is also the core demand of enterprises like Zhiyuan and Youbi Select in developing dexterous hands.
It is reported that CHINGMU Qing Tong Vision launched an optical finger motion capture device last year, which can capture the subtle movements of human fingers such as flexion and extension, and grasping. Most dexterous hand enterprises in China, such as Lingzhuo Intelligent, are using this device to train robots.
"For instance, when grasping a cup, human fingers will adjust the force based on the shape of the cup. The bending angle error of the fingertip joints cannot exceed 1 degree. Traditional motion capture simply cannot achieve such precision." Zhang Haiwei gave an example. However, CHINGMU Qing Tong Vision's finger motion capture uses actively illuminated encoded Mark points to distinguish the joint positions of each finger, and even can capture the force variation data when 'pinching a piece of paper'.

In Zhang Haiwei's view, the long-term goal of humanoid robots is interactive intelligence, which means enabling the robots to freely interact with humans and the environment. This scope is broader than that of operational intelligence: not only should they interact with objects, such as tightening screws or opening doors, but also with humans, such as delivering items while avoiding someone's arm, and even collaborating with other robots.
"This is the true embodied intelligence," Zhang Haiwei emphasized. "For instance, in a factory, two robots work together to assemble parts. One hands over tools and the other screws them in. The coordination of their movements requires precise temporal synchronization. This necessitates motion capture equipment to record the human collaborative movement logic and then convert it into interaction data for the robots." This is the foundation for the practical application of humanoid robots.
# From "Coach" to "Examiner"
If training is about teaching skills, then evaluation is about "qualification assessment".
For humanoid robots, the technical roadmap is still in the early stage, and designs have not yet converged. Any motion that has not undergone rigorous evaluation is an unsafe motion — this is another core value of motion capture technology.
In the R&D phase of humanoid robots, motion capture undoubtedly serves as a performance tuning tool. For instance, CHINGMU’s optical motion capture systems can record the motion trajectories of a humanoid robot’s hip and ankle joints, compare them with data from normal human walking, and identify the root causes of abnormal movements, which may be knee joint angle errors or mistimed center-of-gravity shifts.
“It’s like performing a motion CT scan for the robot,” said Zhang Haiwei. “Traditional tuning relies on visual observation, which involves large errors and low efficiency. Motion capture, however, controls motion accuracy to the sub-millimeter level and improves tuning efficiency by more than 10 times.”
At the production stage, motion capture acts as a "quality inspector". This can be referenced from the evaluation logic of robotic vacuums: while robotic vacuums require testing of obstacle avoidance accuracy and path planning errors, humanoid robots need to be evaluated on indicators such as walking stability and repetitive positioning accuracy.
For example, when manufacturers claim a robot "can walk continuously for 10 kilometers without falling", motion capture systems record its gait data on different surfaces and even simulate minor collision scenarios to test its anti-interference capability.
“Many companies now claim their robots won’t fall over when kicked, but how much force they can withstand and on which surfaces — all these require quantitative evaluation via motion capture,” Zhang Haiwei added. Currently, several domestic institutions have partnered with CHINGMU to build humanoid robot evaluation lines, focusing on testing "repetitive positioning accuracy". For example, the error must not exceed 0.5 mm when a robot repeatedly grasps a screw at the same position, a core indicator ensuring automated factory production.
When robots enter application scenarios, motion capture can also conduct skill assessments.
Just as humans must take test subjects 2 and 3 to obtain a driver’s license, robots must pass "skill exams" to work in factories. For the "screw-driving skill", motion capture records the robot’s rotation speed, force, and angle to determine whether it meets factory production standards. For "home service skills", it checks whether the robot bumps into people when handing objects or pinches hands when opening doors.
“Only robots that pass evaluation can be truly deployed,” Zhang Haiwei emphasized. “This is also a key focus of CHINGMU: cooperating with testing authorities to establish a skill standard system for humanoid robots through motion capture, further promoting the standardization of humanoid robots.”
---
# Why Optical-Inertial Hybrid Solutions Represent the Future of Motion Capture
Despite the significant value of motion capture, traditional technologies have long suffered from two bottlenecks: optical motion capture is vulnerable to **occlusion**, and inertial motion capture is prone to **drift**.
These issues are particularly prominent in humanoid robotics: robot fingers and joints easily block marker points, while cumulative errors in inertial systems cause robots to "drift off course" over time.
At the IROS conference, CHINGMU launched its **optical-inertial hybrid motion capture solution**, an innovative response to these two pain points. The solution integrates the high precision of optical motion capture with the continuity of inertial motion capture, and incorporates unique designs such as active light-emitting encoding and magnetometer elimination. This greatly improves the applicability of optical motion capture in robotic scenarios, delivering a truly practical optical-inertial fusion solution.
Zhang Haiwei explained that traditional optical-inertial hybrids mostly adopt "loose coupling", where optical and inertial systems each output complete motion data, which are then averaged and fused. The drawbacks are obvious: when optical data is lost, drifting inertial data degrades overall accuracy; when inertial drift occurs, occluded optical data cannot be corrected.

CHINGMU’s tightly coupled solution takes a fundamentally different approach. Instead of relying on finished data from either optical or inertial systems, it directly draws on raw data from both sources — namely pixel information from optical tracking, and acceleration and angular velocity data from inertial sensors — and performs real-time interactive calibration through advanced algorithms. For example, when an optical marker is occluded by a finger and lost, raw inertial data temporarily fills the gap, while referencing previously recorded optical position data to prevent drift. When minor errors occur in the inertial system, they are instantly corrected by optical pixel data. “A loosely coupled solution is like five experts scoring independently and then averaging the results — with no communication, all their flaws accumulate. A tightly coupled solution, by contrast, is like five experts discussing and scoring together, where strengths complement each other and weaknesses cancel out,” explained Zhang Haiwei. This system delivers far stronger data continuity while maintaining high positioning accuracy, fully meeting the real‑time training requirements of robots. Furthermore, the robust design featuring magnetometer‑free operation and active light-emitting markers enables CHINGMU’s solution to adapt to complex environments, making it a standout highlight of the system.

Another pain point of inertial motion capture is **magnetic field interference**. Mobile phones, computers, and metal equipment can all disrupt the direction judgment of the magnetometer, causing the robot to “drift off course.”
CHINGMU’s solution is straightforward: **remove the magnetometer entirely**. It uses optical data to calibrate the inertial orientation in real time, combined with dedicated algorithms, to completely solve the drift problem.
To address the issues of traditional optical motion capture — sensitivity to reflections and noise — CHINGMU has innovated **active light-emitting encoded markers**.
Traditional markers rely on reflected light from cameras, making them vulnerable to glare from glass and metal, and even dust in the air can create “false markers.”
CHINGMU’s markers are self-illuminating, with each LED flashing at a uniquely coded frequency, much like assigning an “ID card” to each marker.
By recognizing the flashing code, the camera can distinguish real markers from interference. Even when operating near glass surfaces or metal parts, no data packet loss occurs.


"In the past, performing finger motion capture required setting up more than 20 cameras to avoid occlusion, but now only 2 to 3 cameras are sufficient," Zhang Haiwei explained. This solution not only cuts equipment costs by reducing the number of cameras by 80%, but also simplifies deployment. It allows motion capture setups to be quickly built in factory workshops, home kitchens, and other environments. The system can even move along production lines with workers to collect data, laying the foundation for "concomitant data collection" that supports the real‑world deployment of humanoid robots.

# Solving the Data Thirst of Humanoid Robots
“The entire industry is currently stuck on data — without data, robots cannot perform tasks properly,” Zhang Haiwei stated frankly. The data demand of humanoid robots far exceeds that of ChatGPT and autonomous driving.
ChatGPT only requires textual data, and autonomous driving operates in a “two‑dimensional space with no interaction,” whereas humanoid robots involve “three‑dimensional space with strong interaction.” They require multi‑dimensional data including motion, tactile feedback, environment, and object properties, with a data volume potentially more than 1,000 times that of autonomous driving.
To address this pain point, CHINGMU is advancing the construction of a “high‑quality humanoid robot dataset” and proposing a “multi‑dimensional quality standard,” aiming to solve the industry’s problems of insufficient data volume, low quality, and lack of universality.
The first criterion for high‑quality data is **multimodality**. Humanoid robots cannot rely on motion data alone; they also need tactile, environmental, and object data. For a simple task such as “driving a screw,” in addition to the motion trajectory of finger joints, it is also necessary to collect pressure changes at the fingertips (tactile data), material hardness of the screw (object properties), and workbench height (environmental data).
“When a robot drives a screw, excessive force may damage the thread, while insufficient force results in a loose fit — all of which rely on tactile data,” Zhang Haiwei explained.
Furthermore, these data must achieve the more critical feature of **spatiotemporal alignment**, meaning motion, tactile, and environmental data must be fully synchronized in time and space. For instance, pressure data and screw position data must correspond to the exact moment the finger applies force; otherwise, the trained robot will suffer from “mismatched motion and force.”
CHINGMU’s motion capture equipment has achieved **microsecond‑level spatiotemporal synchronization**, ensuring consistent timestamps across all data and providing precise support for subsequent training.
The second criterion for high‑quality data is the “Three Highs”: **high precision, high sensitivity, and high degrees of freedom**.
High precision means motion errors must be controlled within 0.1 mm.
High sensitivity applies to tactile data — for example, a force change of 0.1 gram upon finger contact with an object must be detectable.
High degrees of freedom serve the versatility of humanoid robots: data must cover full‑joint motion of the fingers, arms, and torso, enabling complete recording of complex actions such as “picking up a hair with tweezers” or “opening a door with a key.”
“If a dataset has low degrees of freedom, the trained robot can only perform simple actions, unable to use tools or adapt to different scenarios,” Zhang Haiwei emphasized. Only truly high‑quality data can support general‑purpose manipulation in robots, covering more scenarios and enabling easier transfer and reuse across different robot platforms.

Thirdly, data must be **authentic**. Traditional data collection relies on simulated environments — for example, building a mock production line in a lab — which is costly and unrealistic.
CHINGMU’s innovation lies in **concomitant data capture**, and such real-scene data is far more valuable than simulated lab data.
“We can’t build 1,000 different production lines in a lab, but we can collect data from 1,000 factories,” Zhang Haiwei explained. Concomitant capture not only reduces costs but also captures implicit human skills and experience that cannot be replicated in a lab, yet are critical for robot deployment and optimization.
Fourthly, post-processing must be simple, making data usable and easy to apply.
Drawing on its deep experience in the film and animation industry, CHINGMU observed that most motion capture data from productions requires 10 days of cleaning for just one day of captured footage. However, the robotics industry lacks expertise in data cleaning and cannot afford such processes.
CHINGMU’s datasets feature minimal post-processing: data noise is below 1%, allowing raw data to be used directly for training without refinement.
Humanoid robots need real-time usable data, not polished or aesthetically optimized data.
This has become one of the core advantages making CHINGMU’s data capture trusted by industry leaders such as Unitree and Zhiyuan.
---
## Conclusion: Motion Capture Is More Than Just “Recording Movements”
Once robots accumulate sufficient expert motion data, they can also **teach humans in return**.
During the conversation, Zhang Haiwei outlined a future integrating humanoid robots and motion capture systems.
For instance, a badminton coach robot could use motion capture to compare a trainee’s posture with professional players and correct form in real time.
A skilled technician robot in a factory could demonstrate high-precision screw-driving motions to help new workers master skills quickly.
“Robots can aggregate the experience of 100 experts, achieving greater precision than human coaches,” Zhang Haiwei said.
This represents the extended value of embodied intelligence: with motion capture technology, robots evolve from tools into teachers.

In CHINGMU’s roadmap, the commercialization of humanoid robots will likely unfold in three stages: The first stage is **factory scenarios**, where motion capture is used to train robots to perform standardized tasks such as screw driving and parts assembly. The second stage is **elderly care scenarios**, focusing on training robots to assist seniors with dressing, medication delivery and other tasks, with an emphasis on safe and gentle physical interaction. The third stage is **home scenarios**, enabling robots to master cooking, cleaning, childcare and other daily chores. This will require extensive concomitant data collection in real home environments, where motion capture’s dual capabilities of “training + evaluation” will be indispensable. Today, from the dance optimization of Unitree H2 and the precise grasping of Zhiyuan’s dexterous hands, to the optical-inertial hybrid system showcased at IROS and the testing line built with Zhejiang Institute of Quality Science, CHINGMU is proving through technology that: motion capture is not merely an exclusive tool for the film and animation industry — it can also serve as critical infrastructure for the humanoid robot industry. It solves the challenge of robots “learning movements”, establishes standards for robots to “qualify for real-world tasks”, and addresses the industry’s pain point of data shortage. As more robots master precise movements through motion capture, as more enterprises adopt high-quality datasets, and as motion capture becomes an integral sensory capability for robots, the day when humanoid robots move from laboratories into homes and factories may come sooner than expected.