Openai blames one of The longest breakdowns in its history On a “new telemetry service”, he went wrong.
Wednesday, the Openai propulsion chatbot platform, Cat; her video generator, Sora; And his API oriented by the developers experienced major disturbances from around 3 p.m. from the Pacific. Openai recognized the problem shortly after and started working on a solution. But it would take the company for about three hours to restore all the services.
In a post-mortem published Late Thursday, Openai wrote that the breakdown was not caused by a security incident or a recent product launch, but by a telemetry service he deployed on Wednesday to collect Kubernetes measures. Kubernetes is an open source program that helps manage containers or applications and related files that are used to run software in isolated environments.
“Telemetry services have a very wide footprint, so the configuration of this new service has involuntarily provoked … API Kubernetes operations with high intensity of resources,” wrote Openai in the post-mortem. “(Nos) API Kubernetes servers have become exceeded, eliminating the Kubernetes control plan in most of our large clusters (Kubernetes).”
This is a lot of jargon, but basically, the new telemetry service assigned Kubernetes operations of Openai, including a resource on which many services of the company are based for the DNS resolution. DNS resolution converts IP addresses in domain names; This is the reason why you can type “Google.com” instead of “142.250.191.78”.
The use by OPENAI of the DNS cache, which contains information on the domain names previously sought (such as website addresses) and their corresponding IP addresses, questions complicated by “visibility (ING)), wrote Openai and” allowing the deployment (of the telemetry service) to continue before the complete scope of the problem is heard. »»
OPENAI says he was able to detect the problem “a few minutes” before customers finally started to see an impact, but that he could not quickly implement a correction because he had to bypass the Kubernetes submerged servers.
“It was a confluence of multiple systems and processes failing simultaneously and interacting unexpectedly,” wrote society. “Our tests did not have the impact that the change had on the Kubernetes control plan (and) sanitation was very slow due to the locked effect.”
OPENAI says that it will adopt several measures to prevent similar incidents from occurring in the future, in particular improvements to progressive deployment with better monitoring of infrastructure changes and new mechanisms to guarantee OPENAI engineers in all circumstances can access the API Kubernetes servers in all circumstances.
“We apologize for the impact that this incident has caused all our customers – from chatgpt users to developers to companies that count on OPENAI products,” wrote OpenAI. “We are below our own expectations.”
Techcrunch has a newsletter focused on AI! Register here To get it in your reception box every Wednesday.