software engineering

Software Architects

Somebody mentioned Gregor Hohpe’s book The Software Architect Elevator. I’m going to have to read it to be able to reconcile it against this feedback that emphatically states there’s no room for architects in software.

Meanwhile, elevator simulators are a classic software engineering/scheduling exercise.

On building the plane while flying it

“Building a plane while flying it” or some variation has been used to describe innumerable situations, but is it as ridiculous and unprecedented as those who invoke it would have you believe? » about 5000 words

Bare metal clouds are hard

The problem, explains Eclypsium, is that a miscreant could rent a bare-metal server instance from a provider, then exploit a firmware-level vulnerability, such as one in UEFI or BMC code, to gain persistence on the machine, and the ability to covertly monitor every subsequent use of that server. In other words, injecting spyware into the server’s motherboard software, which runs below and out of sight of the host operating system and antivirus, so that future renters of the box will be secretly snooped on.

Indeed, the researchers found they could acquire, in the Softlayer cloud, a bare-metal server, modify the underlying BMC firmware, release the box for someone else to use, and then, by tracking the hardware serial number, wait to re-provision server to see if their firmware change was still intact. And it was. BMC is the Baseband Management Controller, the remote-controllable janitor of a server that has full access to the system.

» about 500 words

There are no architects at Facebook

We get there through iteration. We don’t try to build an architecture that is failproof. Building an architecture and worrying about it for months and months at a time before you actually go deploy it tends to not get us the result we want because by the time we’ve actually deployed something the problem has moved or there are more technologies available to solve different problems.

We take it seriously enough to say “there are no architects on the team.”

We do a very “you build it you own it” process, where any team or any individual or any engineer that builds or designs something, they own it, and they do the on-call for it.

On call is where we learn, and that’s how we improve over time.

You build a system…you don’t have to be perfect. Deploy it, and as long as you have enough detection and mitigation capabilities, you will do OK. And you will learn, and you will iterate over it, and you will get better over time.

From the NANOG73 keynote: “Operations first, feature second” by Facebook VP of Network Engineering Najam Ahmad. It’s at about the 10:20 mark in the video:

Market risks and opportunities for Linux distro vendors

IBM’s acquisition of Red Hat got me thinking about how the market for commercially supported Linux distros is changing. IBM is trying to find a foothold in a maturing market dominated by AWS while the market for enterprise data centers is shrinking. So, where is Linux being used (or will be used), and what’s changing in those spaces? » about 1200 words

Kubesprawl

This leads to the emerging pattern of “many clusters” rather than “one big shared” cluster. Its not uncommon to see customers of Google’s GKE Service have dozens of Kubernetes clusters deployed for multiple teams. Often each developer gets their own cluster. This kind of behavior leads to a shocking amount of Kubesprawl.

From Paul Czarkowski discussing the reasons and potential solutions for the growing number of Kubernetes clusters.

Hard solutions to container security

The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host.

From Aleksa Sarai explaining the latest Linux container vulnerability.

To me, the underlying message here is: Containers are Linux.

From Scott McCarty washing his hands of it.

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs.

From the Kata Containers website. The project is intended to be “compatible with the OCI specification for Docker containers and CRI for Kubernetes” while running those containers in a VM instead of a namespace.

The future of Kubernetes is Virtual Machines, not Containers.

From Paul Czarkowski, discussing multitennancy problems and solutions for Kubernetes.

Shuffle sharding in Dropbox's storage infrastructure

Volumes are spread somewhat-randomly throughout a cell, and each OSD holds several thousand volumes. This means that if we lose a single OSD we can reconstruct the full set of volumes from hundreds of other OSDs simultaneously. This allows us to amortize the reconstruction traffic across hundreds of network cards and thousands of disk spindles to minimize recovery time. » about 300 words

Parts of a network you should know about

If you’re running infrastructure and applications on AWS then you will encounter all of these things. They’re not the only parts of a network setup but they are, in my experience, the most important ones.

The start of Graham Lyons’ introduction to networking on AWS, which (though the terms may change) is a pretty good primer for networking in any cloud environment. Though cloud infrastructure providers have to deal with things at a different later, Graham’s post covers the basics—VPCs, subnets, availability zones, routing tables, gateways, and security groups—that customers need to manage when assembling their applications.

We're gonna need a bigger PRNG cycle length...

The general lesson here is that, even for a high quality PRNG, you can’t assume a random distribution unless the generator’s cycle length is much larger than the number of random values you’re generating. A good general heuristic is —

If you need to use n random values you need a PRNG with a cycle length of at least n².

From a 2015 post by Mike Malone on PRNGs vs. random key collisions. The Chrome/V8 bug that caused Mike to write nearly 5000 words to explain has since been fixed, but you can check your browser’s PRNG here.

Interfaces, surface area, durability

A DOS program can be made to run unmodified on pretty much any computer made since the 80s. » about 100 words

In praise of refactoring

Under the right conditions refactoring provides a sort of express lane to becoming a master developer. […] Through refactoring, a developer can develop insights, skills, and techniques more quickly by addressing a well understood problem from a more experienced perspective. Practice make perfect. If not the code, maybe the coder.

From Patrick Goddi, who argues refactoring is about more than code quality.

Apple CloudKit uses FoundationDB Record Layer

Together, the Record Layer and FoundationDB form the backbone of Apple’s CloudKit. We wrote a paper describing how we built the Record Layer to run at massive scale and how CloudKit uses it. Today, you can read the preprint to learn more.

From an anonymous FoundationDB blog post introducing relational database capabilities built atop FoundationDB’s key-value store. The paper about CloudKit (PDF) is also worth a read. CloudKit is Apple’s free at any legitimate scale back-end as a service for all iOS and MacOS apps.

Technology choices, belonging, and contempt

I was taught to be contemptuous of the non-blessed narratives, and I was taught to pay for my continued access to the technical communities through perpetuating that contempt. I was taught to have an elevated sense of self-worth, driven by the elitism baked into the hacker ethos as I learned to program. By adopting the same patterns that other, more knowledgable people expressed I could feel more credible, more like a real part of the community, more like I belonged.

I bought my sense of belonging, with contempt, and paid for it with contempt and exclusionary behaviour.

And now, I realise how much of it is an anxiety response. What if I chose the wrong thing? What if other people judge me for my choices and assert that my hard-earned skills actually aren’t worth anything?

From Aurynn Shaw on cultures of exclusion and contempt in technology professions.

Rollback buttons and time machines

Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. […]

Mounting a rollback button within easy reach […] means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.

From Dan McKinley, speaking about the complications and near impossibility of rolling back a deployment.

Don't let requests linger

In practice, we have fixed whole classes of reliability problems by forcing engineers to define deadlines in their service definitions.

From Ruslan Nigmatullin and Alexey Ivanov on Dropbox’s migration to gRPC. Also consider request replication.

Incident postmortems: customer communication

How do we learn from incidents, and how do we rebuild customer trust after an incident? Customer-facing postmortems are critical to this, but they have to answer the right questions. » about 700 words

PID controllers are way cooler than the Wikipedia article lets on

PID controllers are all around us. Their straightforward, closed loop cycle of set-point, observation, and response are the basis for almost every control system in our everyday world, but the the Wikipedia article obscures their simple beauty and ubiquity. » about 700 words

Common root causes of intra data center network incidents at Facebook from 2011 to 2018

From A Large Scale Study of Data Center Network Reliability by Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu, the categorized root causes of intra data center incidents at Fabook from 2011 to 2018:

Category	Fraction	Description
Maintenance	17%	Routine maintenance (for example, upgrading the software and firmware of network devices).
Hardware	13%	Failing devices (for example, faulty memory modules, processors, and ports).
Misconfiguration	13%	Incorrect or unintended configurations (for example, routing rules blocking production traffic).
Bug	12%	Logical errors in network device software or firmware.
Accidents	11%	Unintended actions (for example, disconnecting or power cycling the wrong network device).
Capacity planning	5%	High load due to insufficient capacity planning.
Undetermined	29%	Inconclusive root cause.

Two notes worth considering:

We use “failures” to refer to any network device misbehavior. The root cause of a failure includes not only hardware faults, but also misconfigurations, maintenance mistakes, firmware bugs, and other issues.

And:

We use Govindan et al.’s definition of root cause: “A failure event’s root-cause is one that, if it had not occurred, the failure event would not have manifested.”

Time synchronization is rough

CloudFlare on the frustrations of clock skew:

It may surprise you to learn that, in practice, clients’ clocks are heavily skewed. A recent study of Chrome users showed that a significant fraction of reported TLS-certificate errors are caused by client-clock skew. During the period in which error reports were collected, 6.7% of client-reported times were behind by more than 24 hours. (0.05% were ahead by more than 24 hours.) This skew was a causal factor for at least 33.5% of the sampled reports from Windows users, 8.71% from Mac OS, 8.46% from Android, and 1.72% from Chrome OS.

They’re proposing Roughtime as a solution.

Windows 95 was 30MB. Today we have web pages heavier than that!

The title is a quote from Nikita Prokopov, who is wallowing in disenchantment.

git foo

A few git commands I find myself having to look up:

Resolve Git merge conflicts in favor of their changes during a pull:

git pull -Xtheirs
git checkout --theirs the/conflicted.file

Source

Viewing Unpushed Git Commits

git log origin/master..HEAD

You can also view the diff using the same syntax:

git diff origin/master..HEAD

Or, “for a little extra awesomeness”

git log --stat origin/master..HEAD

Updated since it was first posted:

Starting with Git 2.5+ (Q2 2015), the actual answer would be git log @{push}… See that new shortcut @{push}