2023-11-29 - From Cost Reduction and Increased Smiles to Real Cost Reduction and Increased Efficiency - Huxiu Network

From Cost Reduction to Real Efficiency Improvement - Huxiu#

#Omnivore

Highlights#

Complexity is a cost, and therefore simplicity should be a key goal when building systems. ⤴️ ^571189de

"Intellectual power" is another important issue. Intellectual power is difficult to accumulate spatially — the intellectual power of a team often depends on the level of a few senior key figures and their communication costs. For example, when there is a problem with the database, a database expert is needed to solve it; when there is a problem with Kubernetes, a K8S expert is needed to look into it;

However, when you put the database into Kubernetes, it is difficult to combine the intellectual bandwidth of a separate database expert and a K8S expert — you need a dual-skilled expert to solve the problem. The same applies to the circular dependency of authentication services and object storage — you need engineers who are familiar with both. Using two independent experts is not impossible, but the synergy gain between them can easily be diminished by the squared growth of communication costs, leading to negative returns, which is why more people can become less effective during failures. ⤴️ ^8a2f1e42

However, the tacit knowledge of the organization dissipates as experienced personnel leave, and after a certain degree of loss, the system is already on the verge of collapse — it just takes a certain opportunity to trigger an explosion. ⤴️ ^1ec94f76

The third point is management philosophy and insight. For example, building stability requires an investment of ten million, and there will always be opportunistic teams claiming: we only need five million or less — and then they might do nothing, betting that no problems will arise; if they win the bet, they profit for free, and if they lose, they walk away. But it is also possible that this team has the real ability to reduce costs with technology, but how many people in leadership positions have enough insight to truly discern this? ⤴️ ^d92fc421

Isn't this just like what the merchant in Spice and Wolf said about what sells well, which is actually nonsense, but if it hits, you can share the profits, and if it doesn't, there’s no cost?

The issue of lack of security ⤴️ ^b97b5eb8

Mencius said: "If a ruler views his ministers as hands and feet, then the ministers will view the ruler as their heart; if a ruler views his ministers as dogs and horses, then the ministers will view the ruler as a commoner; if a ruler views his ministers as dirt and weeds, then the ministers will view the ruler as an enemy." ⤴️ ^9aad170b

From Cost Reduction to Real Efficiency Improvement#

This article discusses the cost reduction and efficiency improvement issues of major internet companies and points out the phenomenon of cost reduction leading to laughter. The author believes that true efficiency improvement requires reducing the cost of system complexity and increasing management effectiveness. However, many companies perform poorly in these two areas, leading to frequent service failures. The article calls for effective regulation of internet platforms and emphasizes the importance of genuine cost reduction and efficiency improvement.

• 💡 Cost reduction and efficiency improvement require reducing the cost of system complexity and increasing management effectiveness.

• 💡 Complexity cost is a key goal to consider when building systems.

• 💡 Poor management level and philosophy are one of the reasons for the failure of cost reduction and efficiency improvement.

The end of the year is the time for performance boosts, yet major accidents in internet companies come one after another. They have turned cost reduction and efficiency improvement into "cost reduction and laughter" — this is no longer just a meme, but a form of self-mockery from the officials.

Just after Double Eleven, Alibaba Cloud experienced a globally record-breaking epic failure, followed by a series of incidents in November, after a few minor failures, there was a two-hour major failure in cloud database management across countries — from monthly explosions to weekly explosions to daily explosions.

But before the words were out, Didi experienced a major failure lasting over 12 hours, resulting in losses of hundreds of millions — the alternative, Alibaba's ride-hailing service, directly exploded with orders, making a huge profit, which can be described as losing one thing and gaining another.

I have already reviewed the epic failure of Alibaba Cloud in my article "What Can We Learn from Alibaba Cloud's Epic Failure": Auth went down due to misconfiguration, with the root cause suspected to be the OSS/Auth circular dependency, which caused a deadlock when the black and white lists were changed.

Didi's issue was reportedly a disastrous upgrade of Kubernetes. Such astonishing recovery times are usually related to storage/databases, and it is reasonable to speculate that the root cause was: accidentally downgrading the K8S master and jumping multiple versions at once — which led to the pollution of metadata in etcd, ultimately causing all nodes to crash and making it impossible to roll back quickly.

Failures are unavoidable, whether due to hardware defects, software bugs, or human errors; the probability of occurrence cannot be reduced to zero. However, a reliable system should have fault-tolerant resilience — it should be able to anticipate and respond to these failures and minimize the impact, shortening the overall failure time as much as possible.

Unfortunately, in this regard, the performance of these internet giants is far below standard — at least their actual performance is far from their claimed "1 minute to detect, 5 minutes to handle, 10 minutes to recover."

Cost Reduction and Laughter

According to the Heinrich principle, behind every major failure, there are 29 accidents, 300 near misses, and thousands of hidden hazards. In the civil aviation industry, if similar incidents occur — even without any real consequences, just two consecutive signs of accidents — even if they are not accidents, a terrifyingly strict industry safety overhaul would immediately commence.

Reliability is crucial, not only for critical services like air traffic control/flight control systems, but we also expect more ordinary services and applications to operate reliably — a global outage of cloud vendors is almost equivalent to a power or water outage, and a failure of a transportation platform means a partial paralysis of the transportation capacity network, while the unavailability of e-commerce platforms and payment tools can lead to significant losses in revenue and reputation.

The internet has penetrated every aspect of our lives, yet effective regulation of internet platforms has not yet been established. Industry leaders choose to lie low and play dead in the face of crises — and no one even comes out to conduct a frank crisis public relations and failure review. No one is there to answer: why did these failures occur? Will they continue to occur? Have other internet platforms conducted self-examinations for this? Have they confirmed that their backup plans are still effective?

We do not know the answers to these questions. But it is certain that the unrestrained accumulation of complexity and the consequences of large-scale layoffs are beginning to show, and service failures will become increasingly frequent, to the point of becoming a new normal — anyone could become the next "unlucky one" that draws laughter. To escape this bleak fate, what we need is genuine "cost reduction and efficiency improvement."

Cost Reduction and Efficiency Improvement

When failures occur, they go through a process of perceiving the problem — analyzing and locating — resolving and handling. All of these tasks require the system's R&D/operations personnel to invest mental effort, and there is a basic rule of thumb in this process:

The time spent on handling failures t = the complexity W of the system and the problem / the available intellectual power P.

The goal of optimizing failure handling is to shorten the recovery time t as much as possible. For example, Alibaba likes to talk about the "1-5-10" stability metrics: 1 minute to detect, 5 minutes to handle, 10 minutes to recover, which sets a hard time target.

Under strict time constraints, you either reduce costs or increase efficiency. However, cost reduction should not come from personnel costs, but from the cost of system complexity; increasing efficiency should not mean increasing the topics of reports and jokes, but rather the available intellectual power and management effectiveness. Unfortunately, many companies have not done well in both areas, turning cost reduction and efficiency improvement into cost reduction and laughter.

Reducing Complexity Costs

Complexity has various aliases — technical debt, spaghetti code, quagmire, architectural juggling. Symptoms may manifest as: explosive state space, tight coupling between modules, tangled dependencies, inconsistent naming and terminology, hacks to solve performance issues, special cases that need to be circumvented, etc.

==Complexity is a cost, and therefore simplicity should be a key goal when building systems.== However, many technical teams do not consider this when formulating plans, but rather make things as complex as possible: tasks that can be solved with a few services are unnecessarily split into dozens of services using microservices concepts; with few machines, they insist on deploying a Kubernetes setup for elastic juggling; tasks that can be solved with a single relational database are unnecessarily divided among several different components or turned into a distributed database.

These behaviors introduce a large amount of additional complexity — which emerges from specific implementations rather than the inherent complexity of the problem itself. A typical example is that many companies, regardless of need, like to shove everything onto K8S, etcd / Prometheus / CMDB / databases, and once problems arise, they face a circular dependency disaster, and a major failure can completely incapacitate the system.

Furthermore, in places where complexity costs should be paid, many companies are unwilling to pay: one data center hosts a single oversized K8S instead of multiple small clusters for gray validation, blue-green deployment, and rolling upgrades. They find compatibility upgrades from one version to another cumbersome and insist on jumping several versions at once.

In a distorted engineering culture, many engineers take pride in the boring scale of being big and clumsy and the high-wire act of architectural juggling — and the technical debt incurred from these struggles will come back to haunt them during failures.

==“Intellectual power” is another important issue. Intellectual power is difficult to accumulate spatially — the intellectual power of a team often depends on the level of a few senior key figures and their communication costs. For example, when there is a problem with the database, a database expert is needed to solve it; when there is a problem with Kubernetes, a K8S expert is needed to look into it;==

==However, when you put the database into Kubernetes, it is difficult to combine the intellectual bandwidth of a separate database expert and a K8S expert — you need a dual-skilled expert to solve the problem. The same applies to the circular dependency of authentication services and object storage — you need engineers who are familiar with both. Using two independent experts is not impossible, but the synergy gain between them can easily be diminished by the squared growth of communication costs, leading to negative returns, which is why more people can become less effective during failures.==

When the cost of system complexity exceeds the intellectual power of the team, catastrophic failures can easily occur. However, this is often difficult to see in normal times: because the complexity of debugging, analyzing, and resolving a problematic service is far greater than the complexity of getting the service up and running. It may seem that cutting two people here and three there still allows the system to run normally.

==But the tacit knowledge of the organization dissipates as experienced personnel leave, and after a certain degree of loss, the system is already on the verge of collapse — it just takes a certain opportunity to trigger an explosion.== In the ruins, a new generation of young inexperienced workers gradually becomes experienced, only to be dismissed for losing cost-effectiveness, and the cycle continues.

Increasing Management Effectiveness

Is Alibaba Cloud and Didi unable to recruit enough excellent engineers? No, it is that their management level and philosophy are poor, and they do not utilize these engineers well. I have worked at Alibaba and also at Nordic-style startups like Tantan and foreign companies like Apple, and I have a deep understanding of the gap in management levels. I can give a few simple examples:

The first point is OnCall duty. At Apple, our team had over a dozen people distributed across three time zones: Berlin in Europe, Shanghai in China, and California in the USA, with working hours seamlessly connected. Engineers in each location had the intellectual power to handle various issues, ensuring that at any given moment, someone was available for OnCall duty during working hours, without affecting their quality of life.

In contrast, at Alibaba, OnCall often becomes a responsibility that R&D needs to take on, with surprises possible 24 hours a day, and it is not uncommon to be bombarded with alerts in the middle of the night. In places where resources can be allocated to solve problems, they are stingy: by the time R&D wakes up groggily, turns on their computer, and connects to VPN, several minutes may have already passed.

The second point is system construction. For example, looking at reports on failure handling, if core infrastructure service changes are made without testing, monitoring, alerts, verification, gray releases, or rollbacks, and if architectural circular dependencies are not considered, then it truly deserves to be called a makeshift team. To give a specific example: the monitoring system. A well-designed monitoring system can greatly shorten the time to determine failures — this essentially involves conducting data analysis on server metrics/logs in advance, which is often the step that requires the most intuition, inspiration, and insight, and is the most time-consuming.

The inability to locate the root cause reflects a lack of observability and inadequate failure response plans. For example, when I was a PostgreSQL DBA, I created this monitoring system, as shown on the left, with dozens of dashboards closely organized; any PG failure could be pinpointed within a minute by drilling down a few layers with a mouse click, allowing for rapid response according to the plan.

Now, looking at the monitoring system used by Alibaba Cloud RDS for PostgreSQL and PolarDB cloud database, everything is just this pitiful single page; if they are using this to analyze and locate failures, it is no wonder it takes them dozens of minutes.

==The third point is management philosophy and insight.====For example, building stability requires an investment of ten million, and there will always be opportunistic teams claiming: we only need five million or less — and then they might do nothing, betting that no problems will arise; if they win the bet, they profit for free, and if they lose, they walk away. But it is also possible that this team has the real ability to reduce costs with technology, but how many people in leadership positions have enough insight to truly discern this?==

Moreover, advanced failure experience is actually a very valuable asset for engineers and companies — it is a lesson learned through real money. However, many managers, when problems arise, immediately think of "sacrificing a programmer/operations person," giving this asset away to the next company. Such an environment naturally fosters a culture of blame-shifting, doing nothing, and complacency.

The fourth point is a people-oriented approach. For example, I have almost fully automated my database management work as a DBA at Tantan. The reason I do this is that I can enjoy the benefits brought by technological progress — by automating my work, I can have plenty of time to drink tea and read the news; the company will not fire me just because I automate my work and spend my days drinking tea and reading the news, so ==the issue of lack of security== allows me to explore freely, and I was able to create a complete open-source RDS by myself.

But could such things happen in an environment like Alibaba? "Today's best performance is tomorrow's lowest requirement," okay, you automated your work, right? As a result, if your work hours are not fully utilized, managers will find some trivial tasks or useless meetings to fill your time; worse yet, after you have painstakingly established a system and reduced your irreplaceability, you immediately face the consequences of "the cunning rabbit dies, the hound is cooked," and ultimately, the person who talks the most in meetings reaps the benefits. Thus, the final strategy in this game of advantage is to be able to work independently, pretending to be on board while swaying until a major failure occurs.

The most frightening thing is that local companies emphasize "people are replaceable screws," and that people are "mined out" by the age of 35, with mass layoffs being common. If job security becomes an urgent issue, who can still focus on doing their job well?

==Mencius said: "If a ruler views his ministers as hands and feet, then the ministers will view the ruler as their heart; if a ruler views his ministers as dogs and horses, then the ministers will view the ruler as a commoner; if a ruler views his ministers as dirt and weeds, then the ministers will view the ruler as an enemy."==

This backward management level is where many companies truly need to improve efficiency.