2018 – 2025: A Detailed Explanation of NVIDIA’s Key Layout in the Field of Embodied Intelligence Robotics
NVIDIA CEO Jensen Huang has repeatedly emphasized that “the next wave of AI will be embodied intelligence.” Based on this insight, NVIDIA has proactively laid out in the field of embodied intelligence since 2018, dedicated to building a complete technical closed-loop and underlying development ecosystem.
Key Layout of NVIDIA in the Field of Embodied Intelligence (Compiled by Machine Awakening Era)
1.2018 Layout in Embodied Intelligence: Preliminary Construction of Robot Development Platform
June 2018: NVIDIA launched the NVIDIA Isaac robot development platform, which includes hardware (Jetson Xavier computing platform) and a suite of software tools (including Isaac SDK, Isaac IMX, and Isaac Sim), preliminarily constructing the infrastructure for robot development, training, and validation.
1) Hardware
Jetson Xavier: A computing platform specifically designed for robotics, where the Xavier SoC chip integrates over 9 billion transistors, enabling more than 30 trillion floating-point operations per second. The chip is equipped with six high-performance processors, including a Volta architecture Tensor Core GPU (1), an octa-core ARM64 CPU (1), deep learning accelerators DLA (2), as well as an image processor (1), visual accelerator (1), and video encoder/decoder (1 each). With its powerful hardware configuration, Jetson Xavier can simultaneously run dozens of algorithms in real time, covering tasks such as sensor processing, ranging, localization and mapping, visual perception, and path planning.
2) Software
NVIDIA provides a complete set of robotic learning software tools for Jetson Xavier, covering the entire process of simulation, training, validation, and deployment, specifically including:
Isaac SDK: A runtime framework containing application programming interfaces (APIs) and tools, equipped with fully accelerated libraries for developing robotic algorithm software.
Isaac IMX (Intelligent Machine Acceleration): IMX, the intelligent machine acceleration application, is a collection of robotic algorithm software developed by NVIDIA. It aims to provide pre-developed and optimized algorithm software for robotic development, covering multiple domains in robotic applications such as sensor processing, vision and perception, localization and mapping, etc.
Isaac Sim: It provides developers with a highly realistic virtual simulation environment for autonomous training and supports hardware-in-the-loop testing using Jetson Xavier.
2022 Layout in Embodied Intelligence: Technical Iteration and Ecosystem Expansion
2.March 2022: NVIDIA Launches Isaac Nova Orin Reference Platform
In March 2022, NVIDIA officially unveiled the Isaac Nova Orin reference platform at GTC Conference. Specifically designed for autonomous mobile robots (AMRs), this computing and sensor platform comprises up to two Jetson AGX Orin computers (≥550 TOPS) and a sensor suite for next-generation AMRs, aiming to accelerate AMR development and deployment.
The Jetson AGX Orin is equipped with an NVIDIA Ampere architecture GPU, Arm Cortex-A78AE CPU, and new-generation deep learning & vision accelerators.
Basic Parameter Information of the Jetson AGX Orin Series (Compiled by Machine Awakening Era)
The sensor suite includes 6 cameras (2 depth-sensing cameras + 4 wide-angle cameras) + 3 LiDARs (2 navigation 2D LiDARs and 1 optional 3D LiDAR for mapping) + 8 ultrasonic radars.
Note: Follow-up expansion – In March 2024, NVIDIA partnered with Segway-Ninebot to release the Nova Orin developer kit, optimized for the Nova Carter AMR platform and pre-installed with the Isaac Perceptor stack to further simplify the secondary development process.
3.November 2022: NVIDIA Releases Embodied Intelligence Agent MinDojo
On November 22, 2022, NVIDIA launched MinDojo, an open-source embodied intelligence agent with internet-scale knowledge. It is a new framework built on the Minecraft game for embodied intelligence agent research.
The model constructs the three most critical elements of embodied intelligence agents: an environment supporting multiple tasks and objectives, a large-scale multimodal knowledge database, and a flexible and scalable agent architecture, providing an important foundation and framework for the research and development of embodied intelligence agents.
2023 Layout in Embodied Intelligence: Integration of Generative AI and Robotics
4.May 2023: NVIDIA Releases Agent Voyager
In May 2023, NVIDIA, in collaboration with researchers from multiple universities including Caltech, the University of Texas at Austin, Stanford University, and Arizona State University, jointly released the agent—Voyager.
The Voyager agent consists of three core components: an automatic learning path, an iterative prompting mechanism, and a skill library.
Voyager is an LLM-driven, lifelong-learning embodied intelligence agent that demonstrates the powerful capabilities of large language models in enabling agents to learn and explore complex tasks, offering new perspectives and directions for AI development. Within the Minecraft virtual environment, it can autonomously explore, generate tasks based on the environment and its own state, continuously learn new skills, and store them in its skill library. This endows it with the characteristics of interacting with the environment and evolving through learning, which are essential for embodied intelligence agents.
Note: Minecraft (《My World》) is a sandbox game developed by Mojang Studios, a subsidiary of Microsoft.
5.October 2023: NVIDIA Launches Eureka
In October 2023, NVIDIA launched Eureka at the GTC conference. Eureka is an AI system dedicated to robot training. Its main function is to automatically generate and optimize reward functions using generative AI and reinforcement learning methods, so as to improve the training efficiency and performance of robots.
Working Principle: Eureka is driven by the GPT-4 large language model and employs a hybrid gradient architecture – the outer loop runs GPT-4 to refine reward functions, while the inner loop executes reinforcement learning to train robot controllers. As shown in the figure above, Eureka uses unmodified environment source code and language task descriptions as context, generating executable reward functions via zero-shot learning of the encoded large language model. The framework then iteratively optimizes among reward function sampling, GPU-accelerated reward evaluation, and reward reflection to gradually enhance the output quality of reward functions.
Application Scenarios: Eureka is mainly applied to the training of complex tasks for robots, especially those that require fine – grained control and advanced skills, such as the dexterous manipulation of robots and the learning of complex motions.
2024 Embodied Intelligence Layout: NVIDIA Launches the General – Purpose Foundation Model GR00T
6.February 2024: NVIDIA Establishes the Generalist Embodied Agent Research Laboratory (GEAR)
In February 2024, NVIDIA established the Generalist Embodied Agent Research (GEAR) laboratory, led by Jim Fan and Yuke Zhu. It is dedicated to building foundation models for embodied agents in virtual and physical worlds, with a focus on four research areas: multimodal foundation models, general – purpose robot models, virtual – world agents, and simulated synthetic data.
7.March 2024: NVIDIA Launches the General – Purpose Foundation Model Project GR00T
In March 2024, at the GTC developer conference, NVIDIA launched the general – purpose foundation model for humanoid robots, Project GR00T. Through the understanding of natural language text/speech and the imitation learning of human behavior videos and live demonstrations, it accelerates the learning and coordination of various skills by humanoid robots, enabling them to adapt to and interact with the real world.
NVIDIA Launches Project GR00T, a General-Purpose Foundation Model for Humanoid Robots (Image Source: NVIDIA)
In addition, it is understood that NVIDIA has also announced partnerships with multiple humanoid robotics companies including 1X Technologies, Agility Robotics, Apptronik, Boston Dynamics, Figure AI, Sanctuary AI, Unitree Robotics, Fourier Intelligence, and XPeng Robotics to jointly develop the “GR00T” project.
Meanwhile, at the conference, NVIDIA also released Jetson Thor, a computing platform specifically designed for humanoid robots, which supports parallel computing of multimodal AI models (such as vision, speech, and motion planning).
Transformer engine that directly supports FP4 (4-bit floating-point) and FP8 (8-bit floating-point) operations, significantly reducing the inference power consumption and latency of large-scale Transformer models (such as GPT and BERT). Additionally, the GPU is divided into three independent clusters, supporting flexible division of computing resources through MIG technology to achieve multi-task parallelism and resource isolation.
In terms of CPU and memory performance, the Thor SoC chip is equipped with a 14-core CPU (including AE extended cores), with performance 2.6 times that of the previous generation, enhancing real-time control (such as motor drive and sensor fusion). The memory bandwidth capacity is doubled to 128GB, with a bandwidth of 273GB/s, supporting local loading of ultra-large-scale models and high-speed data throughput.
Furthermore, Jetson Thor also integrates functional safety islands and various traditional accelerators, such as ISP, video codecs, visual computing engine (PVA), optical flow accelerator (OFA), etc., providing a unified development framework for cross-accelerators (PVA/OFA) – Vision Programming Interface (VPI), which simplifies design and integration work.
NVIDIA Thor SoC Chip Architecture Block Diagram (Image Source: NVIDIA)
In addition, NVIDIA also announced significant upgrades to the Isaac robot development platform:
1) Launch of new foundation models and related tools
Isaac Manipulator: Built on Isaac ROS, it consists of NVIDIA CUDA acceleration libraries, AI models, and reference workflows for robot developers. Aimed at helping robot software developers build AI robotic arms or manipulators that can perceive, understand, and interact with the environment, it supports functions such as motion planning, object detection, and pose estimation & tracking.
Isaac Perceptor: Built on Isaac ROS, it is a collection of NVIDIA CUDA acceleration libraries, AI models, and reference workflows for developing autonomous mobile robots (AMRs). It supports reliable visual ranging and 3D surround vision for obstacle detection and occupancy mapping, aiming to help AMRs perceive, localize, and operate in unstructured environments such as warehouses, factories, and outdoor settings.
Working Principle of NVIDIA Isaac Perceptor (Image Source: NVIDIA)
2) Enhanced Simulation Capabilities
Isaac Lab: A lightweight open-source framework built on Isaac Sim, it leverages NVIDIA PhysX and physics-based NVIDIA RTX rendering to deliver high-fidelity physical simulation. It bridges the gap between high-fidelity simulation and perception-based robot training. Meanwhile, it is specifically optimized for robotic learning workflows, aiming to simplify common tasks in robot research such as reinforcement learning, imitation learning, and motion planning.
Architecture Block Diagram of NVIDIA Isaac Lab (Image Source: NVIDIA)
OSMO: A cloud-native workflow orchestration platform designed to scale complex, multi-stage, and multi-container robotic workloads across on-premises, private, and public clouds. It helps users orchestrate, visualize, and manage a range of robot development tasks on the Isaac platform, including generating synthetic data, training models, conducting reinforcement learning, and implementing software-in-the-loop testing for humanoid robots, autonomous mobile robots (AMRs), and industrial manipulators.
Layout of Embodied Intelligence in 2025: Open – source Foundation Model for Humanoid Robots
8.January 2025, NVIDIA Launches Cosmos and Isaac GR00T Blueprint
In January 2025, at CES, NVIDIA announced the launch of the world – foundation – model platform Cosmos and the synthetic – motion – generation tool Isaac GR00T Blueprint.
1) World – foundation – model Platform Cosmos
NVIDIA Cosmos is a generative world – foundation – model platform launched by NVIDIA. The platform integrates a generative world – foundation model (WFM), an advanced tokenizer (Cosmos Tokenizer), a security guardrail system (Guardrails), and an accelerated video – processing pipeline (NeMo Curator) to help developers generate large amounts of physics – based synthetic data and reduce reliance on real – world data.
Cosmos can accept prompts from text, images, or videos to generate highly – realistic virtual – world states, accelerating the physical AI development for autonomous driving and robotics. Meanwhile, Cosmos also provides a security protection mechanism to ensure data security and compliance. Developers can fine tune the Cosmos model to create customized AI models to meet specific application requirements.
This world foundation model is pre – trained with 20 million hours of robot and autonomous – driving data to generate physics based world states. The model includes a series of pre – trained multimodal models that developers can use out of the box for world generation and reasoning, or for post – training to develop specialized physical AI models.
Cosmos Predict: It is a general – purpose model capable of generating virtual – world states through multimodal inputs such as text, images, and videos. It is built using a Transformer – based architecture, supports multi – frame generation, and can predict intermediate behaviors or motion trajectories given the start and end input images. Additionally, this model is trained on 9000T tokens of robot and autonomous – driving data and is designed specifically for post – training.
Cosmos Transfer: It takes structured visual or geometric data as input, such as segmentation maps, depth maps, LiDAR scans, pose – estimation maps, and trajectory maps, to generate controllable and realistic video outputs. It adopts the ControlNet architecture and uses spatiotemporal control maps to dynamically align the synthetic and real – world representations, ensuring precise spatial alignment and scene composition. It is a world – to – world transfer model designed to bridge the perceptual differences between simulated and real – world environments.
Cosmos Reason: It is a fully customizable multimodal reasoning model. Built based on the understanding of space and time, it has spatiotemporal awareness. It uses chain – of – thought reasoning to understand video data and can predict interaction results for planning responses. In addition, the Cosmos Reason model is trained in three stages: pre – training, general supervised fine – tuning (SFT), and reinforcement learning (RL) to enhance its capabilities in reasoning, predicting, and making response decisions in real – world scenarios.
Application Cases of NVIDIA Cosmos: Synthetic Data Generation (SDG), Policy Model Initialization, Policy Model Evaluation, and Multi – view Generation.
2) Synthetic Motion Generation Tool: Isaac GR00T Blueprint
Isaac GR00T Blueprint provides a complete set of solutions, including the robot foundation model, data pipeline, and simulation framework. It offers a digital – twin training ground for the training of general – purpose robots and helps developers generate massive amounts of synthetic motion data to train robots through imitation learning
9.March 2025, NVIDIA Launches the Humanoid Robot Foundation Model GR00T N1
In March 2025, at the GTC Developer Conference, NVIDIA launched GR00T N1, the world’s first open – source and customizable general – purpose humanoid robot foundation model.
GR00T N1 is a vision – language – action (VLA) model with a dual – system architecture. “System 1” is an action module based on the Diffusion Transformer (DiT). It focuses on the output Tokens of the Vision – Language Model (VLM) through the cross – attention mechanism, and adopts an embodied – specific encoder and decoder to handle variable – dimension states and actions to achieve motion generation. It generates closed – loop motor actions at a higher frequency (120Hz).
“System 2” is a reasoning module based on the Vision – Language Model (VLM). It runs at 10Hz on the NVIDIA L40 GPU, processes the robot’s visual perception and language instructions, and interprets the environment to understand the task goals.
Both “System 1” and “System 2” are neural networks built based on the Transformer. The two are closely coupled and jointly optimized during the training process to achieve efficient collaboration between reasoning and execution.
The data used for the pre – training of the GR00T N1 model includes real – world robot trajectories, synthetic data, and human videos.
The “Data Pyramid” for Robot Foundation Model Training
Collaboration Cases: Robot companies such as 1X Technologies, Agility Robotics, Boston Dynamics, and Fourier Intelligence have accessed GR00T N1, leveraging the foundation model and its supporting toolchain to develop next – generation robotic products and implement practical applications in diverse scenarios.
Conclusion
NVIDIA’s layout in the field of embodied intelligence is reshaping the industrial landscape of embodied intelligence robots by positioning itself as both a “driver of underlying computing power” and a “builder of development ecosystems”. Through high – performance edge computing support provided by the Jetson series chips, combined with Isaac/Omniverse development platforms and the GR00T general – purpose foundation model, it has constructed a full – stack technical closed – loop from hardware to software.
Strategically, NVIDIA has pre – empted the humanoid robot track by investing in Figure AI and collaborating with leading enterprises like Boston Dynamics. Jensen Huang’s vision that “all moving machines will eventually become autonomous” is being steadily implemented through deep technical integration. For example, NVIDIA has organically combined Omniverse digital twin technology, Cosmos physical world models, and Isaac Sim simulation frameworks to build a complete physical AI system. This system enables robots to complete behavioral validation and capability iteration in virtual environments, ultimately achieving seamless migration from virtual to real – world scenarios and significantly improving the development efficiency and application adaptability of embodied intelligence robots.
Commercially, NVIDIA attracts upstream and downstream industry collaboration to accelerate the construction of industry standards through hardware standardization and open software ecosystem strategies. From optimizing Amazon’s logistics scenarios and Toyota’s manufacturing scenarios to investing in and cooperating with Microsoft and OpenAI, NVIDIA is gradually establishing itself as a core infrastructure provider in the era of embodied intelligence through a trinity model of “computing power + models + tools”.
Active Community Global, Inc (ACG EVENTS Global) is a leader in conference planning and production. We produce world-class conferences like summits, technical forums, awards ceremony, company visiting and so on, focusing on areas of most relevance to our served industry sectors. We are dedicated to deliver high-quality, informative and value added strategic business conferences where audience members, speakers, and sponsors can transform their business, develop key industry contacts and walk away with new resources.