市场调查报告书
商品编码
1482384
中国端对端自动驾驶(E2E AD)产业(2024年)End-to-end Autonomous Driving (E2E AD) Research Report, 2024 |
端对端自动驾驶系统是指将感测器资料输入(摄影机影像、光达等)直接对应到控制指令输出(转向、加速、减速等)。它于 1988 年首次出现在 ALVINN 项目中。我们使用一个简单的神经网络,该网络使用相机和雷射测距仪作为输入,并产生转向作为输出。
2024年初,Tesla发表FSD V12.3,大肆宣扬其令人难以置信的智慧驾驶水准。这一端到端的自动驾驶解决方案引起了中国OEM和自动驾驶解决方案公司的广泛关注。
与传统的多模组解决方案相比,端到端自动驾驶解决方案将感知、预测和规划整合到一个模型中,简化了解决方案结构。它可以模拟人类驾驶员直接根据视觉输入做出驾驶决策,能够以模组化解决方案有效处理长尾场景,提高模型训练效率和效能。
Li Auto端对端解决方案
Li Auto认为,完整的端到端模型应该涵盖感知、追踪、预测、决策和规划的整个过程,是实现L3级自动驾驶的最优解决方案。2023年,Li Auto将推广AD Max3.0,整体框架体现了端到端的概念,但距离完整的端到端解决方案仍有差距。2024年,Li Auto预计将系统升级为完整的端到端解决方案。
Li Auto的自动驾驶框架如下,由两个系统组成。
高速系统:系统1,Li Auto现有的端对端解决方案,识别周围情况后直接执行。
Slow System:系统2,多模态大规模语言模型,能够进行逻辑思考并探索未知环境,以解决未知L4场景中的问题。
在推广端到端解决方案的过程中,理想汽车计画在自己的基础上实现端到端的Temporal Planner,将规划/预测模型和感知模型融合起来,将停车与驾驶融为一体。
实现端到端的解决方案需要经过研发团队搭建、硬体设备、资料撷取与处理、演算法训练与策略客製化、验证与评估、推广、量产等流程。
端到端自动驾驶解决方案的整合训练需要大量数据,并面临数据收集和处理的课题。
首先,需要较长的时间和管道来收集数据,包括驾驶数据和道路、天气、交通状况等场景数据。在现实驾驶中,驾驶者前方视野内的数据相对容易收集,但週遭环境的资讯却很难收集。
在资料处理过程中,需要设计资料撷取维度,从海量影片中有效提取特征,并进行资料分布统计,以支援大规模资料学习。
除了自动驾驶汽车之外,实体机器人也是端到端解决方案的主流场景。我们需要建立更通用的世界模型,以适应更复杂、更多样化的现实应用场景,从端到端的自动驾驶到机器人。主流的AGI(通用人工智慧)开发框架可以分为两个阶段:
第一阶段:统一基础模型的理解与生成,并进一步与体现AI结合,形成统一的世界模型
第二阶段:世界模型+规划与控制复杂任务的能力,归纳抽像概念逐渐演进互动式AGI 1.0时代
本报告针对中国端对端自动驾驶(E2E AD)产业进行调查分析,总结自动驾驶现况、发展趋势、应用实例等资讯。
End-to-end Autonomous Driving Research: status quo of End-to-end (E2E) autonomous driving
An end-to-end autonomous driving system refers to direct mapping from sensor data inputs (camera images, LiDAR, etc.) to control command outputs (steering, acceleration/deceleration, etc.). It first appeared in the ALVINN project in 1988. It uses cameras and laser rangefinders as input and a simple neural network to generate steering as output.
In early 2024, Tesla rolled out FSD V12.3, featuring an amazing intelligent driving level. The end-to-end autonomous driving solution garners widespread attention from OEMs and autonomous driving solution companies in China.
Compared with conventional multi-module solutions, the end-to-end autonomous driving solution integrates perception, prediction and planning into a single model, simplifying the solution structure. It can simulate human drivers making driving decisions directly according to visual inputs, effectively cope with long tail scenarios of modular solutions and improve the training efficiency and performance of models.
Li Auto's end-to-end solution
Li Auto believes that a complete end-to-end model should cover the whole process of perception, tracking, prediction, decision and planning, and it is the optimal solution to achieve L3 autonomous driving. In 2023, Li Auto pushed AD Max3.0, with overall framework reflecting the end-to-end concept but still a gap with a complete end-to-end solution. In 2024, Li Auto is expected to promote the system to become a complete end-to-end solution.
Li Auto's autonomous driving framework is shown below, consisting of two systems:
Fast system: System 1, Li Auto's existing end-to-end solution which is directly executed after perceiving the surroundings.
Slow system: System 2, a multimodal large language model that logically thinks and explores unknown environments to solve problems in unknown L4 scenarios.
In the process of promoting the end-to-end solution, Li Auto plans to unify the planning/forecast model and the perception model, and accomplish the end-to-end Temporal Planner on the original basis to integrate parking with driving.
The implementation of an end-to-end solution requires processes covering R&D team building, hardware facilities, data collection and processing, algorithm training and strategy customization, verification and evaluation, promotion and mass production. Some of the sore points in scenarios are as shown in the table:
The integrated training in end-to-end autonomous driving solutions requires massive data, so one of the difficulties it faces lies in data collection and processing.
First of all, it needs a long time and may channels to collect data, including driving data and scenario data such as roads, weather and traffic conditions. In actual driving, the data within the driver's front view is relatively easy to collect, but the surrounding information is hard to say.
During data processing, it is necessary to design data extraction dimensions, extract effective features from massive video clips, make statistics of data distribution, etc. to support large-scale data training.
DeepRoute
As of March 2024, DeepRoute.ai's end-to-end autonomous driving solution has been designated by Great Wall Motor and involved in the cooperation with NVIDIA. It is expected to adapt to NVIDIA Thor in 2025. In the planning of DeepRoute.ai, the transition from the conventional solution to the "end-to-end" autonomous driving solution will go through sensor pre-fusion, HD map removal, and integration of perception, decision and control.
GigaStudio
DriveDreamer, an autonomous driving model of GigaStudio, is capable of scenario generation, data generation, driving action prediction and so forth. In the scenario/data generation, it has two steps:
When involving single-frame structural conditions, guide DriveDreamer to generate driving scenario images, so that it can understand structural traffic constraints easily.
Extend its understanding to video generation. Using continuous traffic structure conditions, DriveDreamer outputs driving scene videos to further enhance its understanding of motion transformation.
In addition to autonomous vehicles, embodied robots are another mainstream scenario of end-to-end solutions. From end-to-end autonomous driving to robots, it is necessary to build a more universal world model to adapt to more complex and diverse real application scenarios. The development framework of mainstream AGI (General Artificial Intelligence) is divided into two stages:
Stage 1: the understanding and generation of basic foundation models are unified, and further combined with embodied artificial intelligence (embodied AI) to form a unified world model;
Stage 2: capabilities of world model + complex task planning and control, and abstract concept induction gradually evolve into the era of the interactive AGI 1.0.
In the landing process of the world model, the construction of an end-to-end VLA (Vision-Language-Action) autonomous system has become a crucial link. VLA, as the basic foundation model of embodied AI, can seamlessly link 3D perception, reasoning and action to form a generative world model, which is built on the 3D-based large language model (LLM) and introduces a set of interactive markers to interact with the environment.
As of April 2024, some manufacturers of humanoid robots adopting end-to-end solutions are as follows:
For example, Udeer*AI's Large Physical Language Model (LPLM) is an end-to-end embodied AI solution that uses a self-labeling mechanism to improve the learning efficiency and quality of the model from unlabeled data, thereby deepening the understanding of the world and enhancing the robot's generalization capabilities and environmental adaptability in cross-modal, cross-scene, and cross-industry scenarios.
LPLM abstracts the physical world and ensures that this kind of information is aligned with the abstract level of features in LLM. It explicitly models each entity in the physical world as a token, and encodes geometric, semantic, kinematic and intentional information.
In addition, LPLM adds 3D grounding to the encoding of natural language instructions, improving the accuracy of natural language to some extent. Its decoder can learn by constantly predicting the future, thus strengthening the ability of the model to learn from massive unlabeled data.