![]() |
市场调查报告书
商品编码
1872684
中国自动驾驶数据闭环(2025)China Autonomous Driving Data Closed Loop Research Report, 2025 |
||||||
要点
2023年至2025年,合成资料占比将从20-30%提升至50-60%,成为补充长尾情境的核心资源。
从资料收集到部署的全流程自动化工具链正在逐步部署,有助于降低成本、提高效率。
高效整合车云一体化资料闭环是实现快速迭代的关键。
自动驾驶资料闭环的本质是 "采集、传输、处理、学习、部署" 的循环最佳化系统。 2025年,业界将加速从 "0到1" 阶段迈向 "高品质、高效率" 时代,面临的主要挑战集中在长尾场景覆盖和成本控制。汽车製造商和一级供应商正在积极建立自身的资料闭环解决方案。高效率的数据收集、处理和分析流程能够持续改进自动驾驶演算法,显着提升智慧驾驶系统的准确性和稳定性。
高品质资料收集的效率将决定智慧驾驶的发展速度。目前,汽车产业的数据来源包括量产车的触发数据传输、采集车辆采集的高价值场景特定数据、利用真实路边数据重建真实世界的工程方法以及基于世界模型的数据合成技术。自动驾驶技术大规模应用的核心路径是利用真实数据建立基础能力,然后利用合成数据突破其限制。从 2023 年到 2025 年,自动驾驶训练资料中真实资料与合成资料的比例将发生显着变化,逐渐从最初以真实资料为中心的模式转向合成资料比例不断增加的混合模式。
自动驾驶资料的闭环管理已从最初关注单一环节(例如,提高标註效率)转向涵盖 "采集、标註、训练、模拟和部署" 的端到端自动化架构。关键突破在于利用大规模 AI 模式和云边协同技术打破资料流壁垒,使闭环管理能够自我演进。
汽车云端资料闭环的本质是建构 "轻量化车辆+智慧云端" 的协同系统,打破资料流壁垒,实现智慧车辆的持续演进。车辆即时撷取环境感知资料(路况、车辆运行资料等),进行去识别化、加密及压缩后上传至云端。云端处理大量资料(PB/EB级),并进行标註、模型训练和演算法最佳化。产生新的特征并透过OTA方式分发给车辆进行升级。
本报告调查分析了中国汽车产业,并总结了与闭环自动驾驶数据相关的趋势。
词彙表
Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented
Key Points:
From 2023 to 2025, the proportion of synthetic data increased from 20%-30% to 50%-60%, becoming a core resource to fill long-tail scenarios.
Full-process automated toolchain from collection to deployment is gradually implemented, helping reduce costs and improve efficiency.
Efficient collaboration of the vehicle-cloud integrated data closed-loop is a key factor in achieving faster iterations.
The essence of autonomous driving data closed-loop is a cyclic optimization system of "collection-transmission-processing-training-deployment". In 2025, the industry is accelerating from the "0->1" stage to the "high-quality and high-efficiency" era, with core contradictions focusing on long-tail scenario coverage and cost control. OEMs and Tier 1 suppliers are actively establishing their own data closed-loop solutions. Through efficient data collection, processing and analysis processes, they continuously improve autonomous driving algorithms, thereby significantly enhancing the accuracy and stability of intelligent driving systems.
The efficiency of acquiring high-quality data determines the evolution speed of intelligent driving. Currently, data sources in the automotive field include mass-produced vehicle-triggered data transmission, high-value specific scenario data collection by collection vehicles, engineering practices for physical world restoration through roadside real data, and data synthesis technology based on world models. The core path for the large-scale application of autonomous driving technology -> real data anchors basic capabilities, and synthetic data breaks through capability boundaries. From 2023 to 2025, the proportion of real data and synthetic data in autonomous driving training data has undergone significant changes, gradually shifting from a real data-dominated model in the early stage to a hybrid model with an increasing proportion of synthetic data.
2023: Real data dominates, synthetic data starts (synthetic data accounts for 20%-30%): Real data is still the main body, mainly used for basic scenario training, but faces the problem of insufficient coverage of long-tail scenarios. For example, Tesla relied on real road test data from over one million vehicles in the early stage, but the collection efficiency of extreme scenarios (such as pedestrians breaking in during heavy rain) is low. Synthetic data accounts for about 20%-30%, mainly used to supplement long-tail scenarios. Experiments by Applied Intuition show that after adding 30% of synthetic data with frequent appearance of cyclists to real data, the recognition accuracy (mAP score) of the perception model for cyclists is significantly improved.
2024: Accelerated penetration of synthetic data (proportion rises to 40%-50%): Synthetic data has upgraded from an "auxiliary tool" to a "core production material". Its penetration rate rising to 40%-50% marks that intelligent driving has entered a new data-driven paradigm. At the end of 2024, the Shanghai High-level Autonomous Driving Demonstration Zone launched a plan of 100 data collection vehicles. Through a hybrid model of "real data collection + world model-generated virtual data", the proportion of synthetic data is close to 50%; for example, Nvidia DRIVE Sim generates synthetic data of distant objects (100-350 meters) to solve the problem of sparse real annotations. After adding 92,000 synthetic images, the detection accuracy (F1 score) of vehicles 200 meters away is improved by 33%.
2025: Synthetic data surpasses (accounts for over 50%): The ratio of synthetic data to real data moves towards "5:5" or even higher. Academician Wu Hequan pointed out that 90% of the training for L4/L5 is simulation data, and only 10%-20% of real data is retained as a "gene pool" to avoid model deviation. In terms of innovative applications of synthetic data, take Li Auto as an example. It uses world models to reconstruct historical scenarios and expand variants (such as virtualizing ordinary intersections into rainy night and foggy conditions), and automatically generates extreme cases for cyclic training. The proportion of synthetic data in Li Auto exceeds 90%, replacing real-vehicle testing and verifying reliability.
According to Lang Xianpeng from Li Auto, in 2023, the effective real-vehicle test mileage of Li Auto was about 1.57 million kilometers, with a cost of 18 yuan per kilometer. By the first half of 2025, a total of 40 million kilometers had been tested, including only 20,000 kilometers of real-vehicle testing and 38 million kilometers of synthetic data. The test cost dropped to an average of 0.5 yuan per kilometer. Moreover, the test quality is high, all scenarios can be inferred from one instance, and complete retesting is possible.
The advantages of synthetic data are not only reflected in cost and efficiency but also in its value density beyond human experience. Synthetic data is generated in batches through technical means at extremely low cost, perfectly matching the high-frequency training needs of AI; it can also independently generate extreme corner case scenarios that "humans have not experienced but comply with physical laws".
The autonomous driving data closed-loop has shifted from focusing on a single link (such as improving annotation efficiency) in the early stage to an end-to-end automated architecture covering "collection-annotation-training-simulation-deployment". The core breakthrough is to break through data flow barriers through AI large models and cloud-edge collaboration technology, realizing closed-loop self-evolution.
LiangDao Intelligence LD Data Factory is a full-link 4D ground truth solution from collection to delivery. The LD Data Factory toolchain product has been delivered to more than a dozen automotive OEMs and Tier 1s in China, Germany, and Japan. This automated 4D annotation tool software has automatically annotated more than 3,300 hours of road-collected data for customers, obtaining high-quality 4D continuous frame ground truth; by the middle of 2025, LiangDao Intelligence had delivered more than 55 million frames of data to a well-known German luxury car brand.
LD Data Factory integrates "data collection, automated annotation, manual annotation, quality control, and performance evaluation". The toolchain includes AI preprocessing and VLM-assisted collection, an automated annotation module for target detection, full-process closed loop of automatic quality inspection, and hybrid cloud and private deployment. LD Data Factory covers several core modules and realizes data management and task collaboration through a unified data management platform: including time synchronization and spatial calibration, distributed storage and indexing services, a visual annotation platform LDEditor (full-stack annotation), an automated quality control module LD Validator, and a perception performance evaluation module LD KPI.
Main products under MindFlow currently include an integrated data annotation platform, a data management platform (including a vector database), and a model training platform, covering the entire value chain from raw data to model implementation. Users can complete the entire algorithm development process in one stop without switching multiple tools or platforms, redefining a new paradigm of AI data services. The technical highlights of its MindFlow SEED platform (third generation) include support for 4D point cloud annotation (lane lines, segmentation), RPA automated processes, and AI pre-annotation covering more than 4,000 functional modules.
Currently, MindFlow empowers customers including SAIC Group, Changan Automobile, Great Wall Motors, Geely Automobile, FAW Group, Li Auto, Huawei, Bosch, ECARX, MAXIEYE, NavInfo and RoboSense.
The essence of the vehicle-cloud integrated data closed-loop is to build a collaborative system of "vehicle-side lightweight + cloud-side intelligence", break through data flow barriers, and realize the continuous evolution of intelligent vehicles. The vehicle side is responsible for real-time collection of environmental perception data (such as road conditions, vehicle operation data), which is uploaded to the cloud after desensitization, encryption, and compression. The cloud processes massive amounts of data (PB/EB level), performs annotation, model training, and algorithm optimization, generates new capabilities, and issues them to the vehicle side to realize OTA upgrades.
The ExceedData data closed-loop solution is a vehicle-cloud integrated solution, which has gained the trust and mass production application of more than 15 automotive OEMs and is deployed in more than 30 mainstream models.
The composition of the ExceedData data closed-loop solution includes the vehicle-side edge computing engine (vCompute), edge data engine (vADS), edge database (vData), as well as the cloud-side algorithm development tool (vStudio), cloud computing engine (vAnalyze), and cloud management platform (vCloud). This solution can reduce data transmission costs by 75%, cloud storage costs by 90%, and cloud computing costs by 33%. According to the calculation of an OEM case cooperating with ExceedData: the total cost optimization can be reduced by 85%.
In terms of OEMs, take Xpeng Motors as an example. Its self-built "cloud-side model factory" has a computing power reserve of 10 EFLOPS in 2025, and the end-to-end iteration cycle is shortened to an average of 5 days, supporting rapid closed-loop from cloud-side pre-training to vehicle-side model deployment.
Xpeng launched China's first 72 billion parameter multimodal world base model for L4 high autonomous driving, which has chain-of-thought (CoT) reasoning capabilities and can simulate human common-sense reasoning and generate control signals. Through model distillation technology, the capabilities of the base model are migrated to the vehicle-side small model, realizing personalized deployment of "small size and high intelligence".
High-value data (such as corner cases) is initially screened through the vehicle-side rule engine. The cloud combines synthetic data generation technologies (such as GAN, diffusion models) to fill data gaps and improve model generalization capabilities. At the same time, end-to-end (E2E) and VLA models integrate multimodal inputs to directly output control commands, relying on cloud-side large model training (such as Xpeng's 72 billion parameter base model) to achieve lightweight deployment on the vehicle side.
With the comprehensive modeling of the entire intelligent driving system, car companies are pursuing "better cost, higher efficiency, and more stable services" in the data closed-loop. The delivery method of intelligent driving is accelerating from delivering code for single-vehicle deployment to a subscription-based cloud service as the core. The efficiently collaborative data closed-loop of vehicle-cloud integration is the key for intelligent vehicles to achieve faster iterations driven by AI.
Glossary