Goldman Sachs: On-the-Ground Research of Humanoid Robots in China

Goldman Sachs: On-the-Ground Research of Humanoid Robots in China

Goldman Sachs visited 8 humanoid robot companies in Beijing/Shenzhen from May 19 to 20, and held a panel discussion with 3 robotics industry companies (founders/R&D heads) at the Goldman Sachs Technology Network Conference (GSTechNet) in Shanghai on May 21, covering 7 private start-ups and 6 executives in total.

Key Insights

Most industry participants believe that robots must integrate general intelligence with practical applications to achieve scale, relying on four core technologies: algorithms, data, computing power, and hardware. While China maintains a strong lead in hardware supply chains, start-ups are now focusing on developing the “brain” of humanoid robots. Companies widely recognize that Visual-Linguistic-Action (VLA) models represent a viable solution, with high-quality real-world data becoming increasingly critical for ensuring physical consistency and task accuracy. Enterprises are investing in “data factories” to collect large-scale real interaction data—one visited company estimates that achieving general-purpose (L3) capabilities requires 10 million hours of data, involving $100-200 million in investment. In computing, NVIDIA’s Jetson Orin remains dominant, but Chinese firms have started exploring collaborations with Huawei.

Given the importance of fine motor control and camera-assisted physical data collection, hardware development is increasingly focused on dexterous hands equipped with tactile and force-feedback sensors. In practical applications, industrial scenarios like material handling and sorting are frequently cited as early commercial opportunities, while consumer-grade applications remain distant due to technical requirements and additional safety/regulatory hurdles. Pricing for humanoid robots varies significantly by functional specifications ($15,000 to $100,000), with the industry expecting scale production and component optimization to drive cost reductions.

Overall, Goldman Sachs holds a positive view on technological progress and long-term prospects for humanoid robots, maintaining its industry forecast of 20,000 global shipments in 2025 and 1.4 million in 2035.

Stock Implications

We continue to focus on component suppliers such as Sanhua Intelligent Control (Buy), Leadshine Intelligent (Neutral), Best Group (Neutral), and Mingzhi Electric (Neutral). Goldman Sachs believes companies in the data collection supply chain may emerge as new beneficiaries.

Key Observations

The World Humanoid Robot Conference to be held in Beijing from August 15 to 17 will feature:

1. 13 sports and performance events, including track and field, artistic gymnastics, football, and solo/group dance;

2. 6 application scenario competitions, covering material handling and sorting in factory environments, drug sorting and unpacking in medical settings, and interactive reception and cleaning in hotel scenarios.

Notably, compared to the Humanoid Robot Half-Marathon in Beijing on April 19, 2025 (which primarily showcased hardware reliability/durability and motor control), this conference will more comprehensively demonstrate humanoid robot technologies—particularly intelligence, versatility, and agility.

Detailed Analysis

Humanoid robot enterprises emphasize that commercialization requires robots to possess general intelligence and execute practical tasks, relying on four core technologies: (1) algorithms, (2) data, (3) computing power, and (4) hardware. Hardware and software are mutually reinforcing, evolving in a spiral—hardware advancements drive software improvements and vice versa. While China is widely recognized for its strong hardware supply chain advantage，our visits revealed that start-ups are increasing R&D investments to enhance the “brain” of humanoid robots.

Algorithms: VLA Models as a Viable Solution

In the generative AI domain, large language models (LLMs) process text through big data analysis to output relevant text. The core architecture of Visual-Linguistic-Action (VLA) models shares no fundamental difference from standard LLMs, except that inputs become a fusion of visual and action data, with outputs being the intended robot actions. VLA architecture is the accepted foundation for most companies, with some integrating tactile information as an additional input. One executive noted that China may lag the U.S. by 0.5-1 year in the research quality of algorithmic structures but is catching up rapidly.

While VLA is seen as an early successful direction, a start-up CEO cautioned that it may not be the ultimate solution, referencing early recurrent neural networks (RNNs) in deep learning—though important foundations, they were eventually replaced as architectures matured. Nevertheless, current work on VLA is critical for future model breakthroughs, with the field expected to undergo 3-4 major iterations.

Data: High-Quality Data Is Paramount

Training models with data typically involves three steps:

1.Preprocessing of Human Work Videos: Teaching the model common-sense knowledge and task structures, e.g., instructing a robot to open a water bottle counterclockwise.

2.Supervised Fine-Tuning (SFT): Refining the algorithm using data collected via teleoperation or simulation to improve precision.

3.Reinforcement Learning from Human Feedback (RLHF): Executing intended actions in real, complex environments and optimizing behavior through iterative feedback.

Despite controversies, the majority view holds that high-quality real-world data is most critical. Based on the above training mechanisms, three types of data are currently used: 2D videos, teleoperation data, and simulation data. Our interviews show ongoing debates about which data is most effective for training embodied AI systems. Some argue that simulation data offers the greatest advantages due to its scalability and low cost, while others emphasize the importance of real-world data for its ability to capture physical consistency (i.e., adherence to real-world physics). Additionally, having diversified datasets to test actions in various scenarios is crucial—constructing diverse scenarios in simulation environments may not necessarily incur lower costs. Overall, most agree on the necessity of a “data recipe”—the effective integration of the three data types—with high-quality real-world data being most critical due to: 1) accuracy (degree of action-physics matching); 2) diversity (variability in environments, object types, and actions).

Scale requirements of 10 million hours are spawning “data factories”: A startup CEO estimates that achieving L3-level general autonomy requires 10 million hours of real-world robot data. This equates to 10,000 robots or operators working continuously for 2 years, demanding $100-200 million in investment and giving rise to the “data factory” concept. Multiple enterprises have already positioned data collection infrastructure (“data factories”) as strategic investments.

Debates on Hardware-Specific Limitations: Diverging views exist regarding the extent to which data collected from one robot can be applied to another. Some argue that “action outputs” are generally easy to migrate across robots, while others note that although there is an ultimate desire to build VLA (Visual-Linguistic-Action) models scalable across different robot morphologies, current data must still be tightly bound to hardware due to the early-stage nature of the models. Even in teleoperation data, multiple data collection methods exist:

•Vision-based systems (cameras or VR): The lowest cost and most scalable, but with the lowest precision (centimeter to decimeter level).

•Inertial Measurement Unit (IMU) sensors: Installed at joints, achieving millimeter-level precision but suffering from drift/cumulative errors.

•Optical motion capture systems (e.g., solutions provided by FZMotion, a subsidiary of Chenshi Intelligence, or Lingyun Optics): Using multi-cameras and reflective markers, precision can be below 0.1 millimeters.

Limited Effectiveness of Government Support: A company mentioned that the government provides subsidies for basic models and partial data. Regarding government-supported data collection centers, even if the government shares open-source datasets, they may aid certain pre-training but are insufficient for vertical-domain fine-tuning or task-level mastery.

Computing Power: NVIDIA Dominates, While Some Enterprises Collaborate with Huawei

Due to edge computing limitations, enterprises are adopting “fast+slow” systems: high-end GPUs like the 4090/3090 remain essential for training or handling complex tasks, while edge devices such as Jetson (Orin, Thor) run VLA models for onboard perception, planning, and lightweight inference. However, some local start-ups are partnering with Huawei, reflecting efforts to build domestic computing infrastructure amid geopolitical risks.

Hardware Improvement Focus: Dexterous Hands

All interviewed enterprises highlighted that dexterous hands are critical for near-term data collection and medium-to-long-term complex/agile tasks in factories. Most other components of humanoid robots (e.g., reducers, motors, lead screws) have become increasingly commoditized due to growing suppliers and industry efforts, though durability, reliability, and heat dissipation still need improvement. However, dexterous hands remain a major bottleneck for two reasons:

1. Mechanical designs struggle to balance load-bearing capacity, flexibility, and cost;

2. Tactile sensors lack performance and cost competitiveness—these sensors, which collect physical parameters like force, torque, temperature, texture, and friction, are vital for training and improving robot AI models.

A domestic tactile sensor company we interviewed is attempting to: establish clear data standards to simplify data collection/processing/training via dexterous hands, and reduce sensor costs through design, algorithm, and material innovations. Its sensors now cost 50% less than overseas counterparts on average.

Industrial scenarios (material handling/sorting) may emerge as early applications, while consumer-grade applications remain distant. Material handling is often regarded by humanoid robot enterprises as an early commercial opportunity in factories, due to its broad cross-industry demand and relatively high tolerance for task performance (especially accuracy and efficiency). One company claims that its humanoid robot has achieved 95% accuracy in material handling, and another company states that the task speed reaches 30% of human labor (60% if the robot operates two shifts daily), with small-scale applications expected to begin in 2025-2026.

Sorting represents another scenario with greater demand (not only in factories but also in retail settings like pharmacies), although it imposes higher speed requirements than material handling. Enterprises also highlight significant demand for palletizing/depalletizing and loading/unloading tasks (e.g., a large EV company employs 20,000 workers for such operations), but many technical bottlenecks persist—such as identifying small objects or those with similar colors.

Additionally, while humanoid robot manufacturers are optimistic about long-term consumer-grade demand (e.g., household chores), they note that the highly diversified consumer environment imposes extreme technical requirements, not to mention regulatory and safety/privacy issues. Furthermore, commercial robots have been deployed in guidance and interactive reception scenarios, which demand lower VLA (Visual-Linguistic-Action) performance. According to a startup, shipments for these use cases have exceeded expectations this year.

The average prices of humanoid robots vary significantly across different specifications and applications, with remarkable cost reduction potential. The companies we visited offer humanoid robots at an average price ranging from $15,000 to $100,000, mainly due to specification differences. Industry consensus indicates that there is currently no universal design meeting all application needs. Low-end products embed very limited intelligent/AI capabilities, only equipped with basic hardware specifications (limited degrees of freedom, no dexterous hands or visual sensors), and can quickly complete basic movements (such as walking, waving) and preset actions (such as dancing). They are mainly sold to universities and AI laboratories for research, or to enterprises/governments for reception or entertainment purposes.

On the other hand, high-end products typically use harmonic reducers (instead of planetary reducers) and ball screws (instead of connecting rods) for arms, and are equipped with visual cameras, force/torque sensors, and dexterous hands with tactile sensors. They aim to perform fine or heavy-duty tasks in factory environments through AI/autonomous functions. Most companies expect annual shipments of hundreds of robots by 2025. Looking ahead, all enterprises agree that humanoid robots have significant cost reduction potential, driven by production optimization (such as specialized equipment and production lines) and output growth, which will reduce unit depreciation and upfront development costs (such as R&D and molds), bringing better return on investment (ROI) for manufacturing clients.