Goto Chicago 2017 Bryan Cantrill

Raw notes.

Recounting the story of a Joyent outage

Here is the register piece on the outage.

How did we get here?

He points out the flip side of such higly sophisticated automation: the stress upon the humans in the loop is amplified. A human fallibility in a semi-automated system is worse than a human fallibility in a non-automated system.

Human fallibility in semi-automated systems

Recounted the story of the Air Canada flight that ran out of fuel in flight. 767-200 in 1983. The fuel mishap was due to the process of converting from imperial to metric units at some place in the system.

Amazon S3 outage.

Whither microservices?

Microservices suffer from the amplification problem mentioned above.

Some non-IT illustrations

1963 power outage in the northeast

This illustrates the notion of the load has to go somewhere.

Used the example of Three Mile Island. When you have auxiliary systems, those systems are not checked. The more alarms and alerts you have, the more likely they will overload the operators.

We are gleefully deploying these distributed systems and telling ourselves they will not fail.

Debugging in the abstract

Debugging is the process by which we understand pathological behavior in the system.

I like how he acknowledges that we have it easy in the software world, compared to the real world. He is a very entertaining speaker, but I don't like how he is yelling at us.

Debugging is the ability to be able to ask the right questions. He described the continually narrowing set of constraints.

The craft of debuggable software

One slide as a nod to what you need to do to make things debuggable.

A culture of debugging

We must have an organizational culture that supports taking the extra time for building for debuggability.

When you have an outage you need to harvest all the useful information and learn for it. Every outage presents an opportunity to advance understanding.