Summing up other stories surrounding
# Summing up other stories surrounding DeepSeek.
I can't stop the road congestion while writing an electronic newspaper column, so I can't see a way to organize it and submit it there anyway, so I'll post my original thoughts here.
I don't know how many times I've talked about DeepSeek since the beginning of the Lunar New Year holiday. After six interviews, two internal meetings, and two consultations during the holiday season and the following week, I want to run away now just hearing the word DeepSeek. Developers, reporters, policy managers, and company members all have various questions about how the AI industry will change in the future, and what directions we should consider. As I answer, the questions are overlapped and summarized in several strands. Now that we have passed a frenzy, I would like to summarize the questions (to secure a link to convey our opinions instead of repeating the same content).
The four main questions are as follows.
1. So is it really cheap?
2. What is the connection with the U.S.-China conflict?
3. What should we do?
4. What happens to OpenAI and NVIDIA?
The answers below are a summary of what we talked about for as little as 15 minutes and as long as two hours. Other people have shared their knowledge and opinions, so I exclude all the content that many people know now, and if possible, write down words that are familiar to programmers or non-researchers (I try).
## Is DeepSeek Really Cheap? Features Of China's AI Market
Then from No. 1. So is it really cheap? In conclusion, it is cheap but expensive. This is because it is the result of a combination of characteristics such as the culture of penetrating AI with China's C++, NPU, and self-help attempts. I vividly remember when I went to Nanjing for a presentation with Google Developers Expert in 2017. An hour's presentation had two and a half hours of Q&A, but there were no TensorFlow questions, all C++ related questions. It was also a time when access to TensorFlow or PyTorch sites was not possible in China in the first place. China's characteristic is that it is a little out of the framework dependence. China has tried a lot of things such as posting C++-based deep learning codes and models on NPU or performing in JavaScript as the IT market itself bypasses desktops and goes to the mobile market. Another characteristic of China is the attempt to self-help, and while doing Tal TensorFlow and Tal PyTorch, they created a framework each such as Alibaba XDL, Tencent TNN, and Mariana, and Baidu's Paddle Paddle Paddle. As a result, there were many cases in which the deep learning curriculum of Chinese universities did not basically teach PyTorch or TensorFlow, but also dealt with their own development framework. Perhaps due to such influence, Chinese engineers have no qualms about making and writing new things themselves.
In addition, DeepSeek is a company established by a team that used to be high-frequency trading (HFT), so a cheap but expensive model like DeepSeek-v3 came out. When it comes to HFT, it is common to rewrite the network stack because you have to risk your life to solve latency. Based on this technology, the DeepSeek team has been creating techniques such as taking devices without NCCL or NVLink since the HFT days, booking some of the SMs to accelerate the network with GPUs, and lowering GPU costs by bypassing error correction routines or general communication standards in packets. (In the HFT field, optimization is common when removing all procedures that can be taken out without using Windows or Linux network stacks.) Since AI has been created by people who have been doing these things, they have reorganized about 20 out of 132 SMs of H800 GPUs for communication-to-server compression/decompression operations that accelerate server-to-server communication between GPUs, and developed DualPipe technology that sends and receives data from the background to InfiniBand while GPUs calculate. This hardware tuning reduces the all-to-all communication overhead to nearly zero, which is said to have increased GPU utilization to nearly 100%, more than quadrupled with fewer GPUs, unlike the usual situation where 75% of GPUs are wasted on communication standby.
In addition to the network, there are many computationally interesting parts, but the forward-backward is unified into the NVIDIA-provided FP8 format (E4M3), and the cumulative error caused by the low accuracy is corrected by reverse conversion to TF32 every four times, and the forward/reverse calculation is overlapped to minimize the GPU's break time, and the memory usage is reduced by recalculating unnecessary intermediate calculations when necessary without storing them in memory. As a result, DeepSeek was able to reduce the cost of GPU training variables to around KRW 8 billion without the need for GPU interconnect infrastructure. Of course, behind the scenes (if rumors are true) there is already a fixed cost of KRW 2 trillion.
## The Meaning of DeepSeek in the U.S.-China Conflict
The second question is about the U.S.-China conflict, which sees China as a sudden drop from the sky, and in fact, it is a market that has grown with huge regulations. At one time, more than 200 foundation models poured out to over 300 LLM companies. Most of them claimed to have made foundation models by fine-tuning and replacing Llama, so the Chinese government introduced the AI license system like making a game license, approved only about 10 companies, and organized the rest. This is a similar development to the Chinese electric vehicle market and the battery market, and the countries that have competed tremendously in China and survived are strong and go abroad.
AI LLM companies that survived in this process experienced a different level of survival competition from AI startups in the United States and Europe. The way they respond immediately when they announce each other's models is similar to the competition between OpenAI and Google. In the case of ByteDance, more than 100,000 GPUs are viewed