Operations

Where can things go wrong in a system and how can we mitigate them? Increasing resiliency is the entire purpose of operations. Facebook’s infrastructure might have worked on a monolith on 1 server in the early days when their users only consisted of a single college population. However, they needed to scale up to hundreds of thousands of servers to adapt to a worldwide user base. They would also need to detect when they needed to scale, and things went wrong in a system rapidly growing in complexity. The only constant in life is change. Therefore engineers must constantly observe and maintain the health of their systems through operations. Operational infrastructure refers to all the processes and tools necessary to keep a system functional over time. While a system may currently work, it’s not guaranteed to always work as features are added or changed, and the scale increases from more user demand.⁠ ⁠ ⁠ To put this material into perspective, the topics covered in this sub-challenge account for only about the last 5 minutes of an interview at most.⁠ ⁠ ⁠

https://www.loom.com/share/38480aaf6bbc4b34bbead5f163cf5ae9?sid=112ccf50-2e59-4ba8-aa8c-5797bd25080d

⁠ Image source ⁠ (Correction: there should be a worker between the message queue and database)

🔊 Communication tips

Summarizing a compendium of industry experience in maintaining systems and handling operations into a brief wrap-up at the end of your design interview means you need to be concise yet clearly portray that you know what you’re talking about. This section approximates all the background and context you should know to competently describe the criteria you would measure and the tools you would use to keep a system running.

📚 Knowledge

The first step of resiliency analysis is identifying them. Some concerns apply to any distributed system, while specific concerns are problems directly relevant to the system at hand. General concerns are worth briefly touching on but do not show as much critical analysis as discussing specific concerns. Prevention is the next step, which comes in the form of monitoring & finding solutions to mitigate the impact of those concerns.

⚠️ Resiliency assessment

Engineers must address aspects of a system that cause failures or prevent it from operating as expected. Examples of generic concerns are too many requests/second and the resulting high latency. Specific concerns include copyright infringement and content moderation of user-submitted videos.

🧠 Wisdom

Service outages and database crashes can happen anywhere. Therefore, identifying the single points of failure (SPOF) is important for anticipating where to add resiliency and determining the consequences of potential issues.

🚦 Mitigations

Mitigation strategies follow the same pattern as resiliency concerns; there are both generic and system-specific approaches to consider.

📚 Knowledge

There are common mitigation strategies that most systems employ, such as rate-limiting and retries to limit the impact of failures to scale or operate correctly. Engineers must also monitor metrics on various criteria by collecting and aggregating data points to analyze in various dimensions. Examples of generic metrics are requests/second, bytes of data created/resource, and the percentage of calls that threw exceptions.

🧠 Wisdom

Meanwhile, using image processing to block explicit videos from being uploaded and using automated content moderation to ban users who violate community guidelines are examples of specific mitigation strategies directly relevant to Youtube or a similar system. Similarly, specific metrics are business and operational metrics, such as videos watched/uploaded per hour, orders delivered/hour, songs played/minute, and too many Likes waiting to be counted.

This is the first page.

Ankit Kashyap

Navigate

Operations

Operations

🔊 Communication tips

📚 Knowledge

⚠️ Resiliency assessment

🧠 Wisdom

🚦 Mitigations

📚 Knowledge

🧠 Wisdom

🟢

⁠Resiliency Assessment⁠➡️

Table of Contents