Michael Wolf
Projects and thoughts from the best view in Cincinnati an okay view in San Jose a chill view of the interstates of Oakland
Home Projects Photos Github Mastodon Nullbrook RSS

Observability For All: Notes From O11ycon 2018
Last edited - 18/08/20

Date: 2018/08/02

Location: The Pearl, San Francisco

O11ycon 2018 was a conference organized by the team at Honeycomb.io focused on exploring and developing the concept of observability in the context of distributed systems engineering. Called a co-creation event and y-know, hippieshit by organizer Rachel Chalmers, the conference was open-ended, and I got the pleasure to participate and even present on stage during the "call for failures." Overall, the experience was wonderful and with buckets of coffee I was able to meet people from a variety of industries to understand how we all have dealt with the complexity of modern software deployment and site reliability.

Setting

The Pearl is a gorgeous event space in the Dogpatch area of San Francisco. It features a large auditorium area, a mezzanine, and a rooftop space with peaceful views of the neighborhood. The organizers of the conference had really decked the place out well --especially for their first event. That being said, it was admittedly San Francisco to the point of stereotypes. Gender-identifying stickers in the bathroom, chia-seed pudding for breakfast, compostable everything, and a space reserved for relaxing with a cup of observabili-tea. But it was great! It was a welcoming and unpretentious environment. The focal point of the conference, small open-space discussions clustered around attendent-submitted topics, were split out around the various corners of the venue. This discussions helped guide the whole discussion of the definition and sharing of best practices of implementing and striving toward observability.

I went into the conference skeptical. My team lead had recommended I attend because he has a great respect for Charity Majors and the work she has done in the system monitoring and SRE space. When the emails and Slack channel flooded in about all of these buzzwords and open-space discussions, I was a little panicked. I wasn't trying to go to a meeting of On-Call-Engineers-Anonymous. Where would the sponsor swag? Furthermore, I was going to this conference without knowing anyone. Regardless, I submitted my talk about taking down our live environment, and got ready to take a Thursday off to go to the City.

Questions Going In Observability On-Call Culture
My caffinated requests from the app designers of the world
Onboarding and Documentation Culture Quotes and Takeaways
"Nines don't matter if the user isn't happy" -Charity Majors
"We're more likely to be paged by a server issue impacting nobody than a client issue impacting everybody"
Action Items My Slides
PDF

A couple months before the conference, a Call for Failures (CFF) was put out, asking for anyone to share a story of a time they royally screwed up/troubleshooted a problem. The presentations were to be short, five minute or so overviews of the incident and resolution. This segment was coordinated by pie (Rachel Perkins) and it was great fun to get up and listen to and share stories of operations gone wrong. It was affirming and participatory in the best way.

I got an email for the last day of the call for failures about an hour after I had accidentally taken down my company's entire playerbase via a bad entry in AWS. Obviously this was a sign to share my experience. Being in the video game industry, this presented a unique use case for observable, human-resilient systems on a tight budget. I was able to share the stage with a former Israeli intelligence agent and a mechanical engineer, who each had different interpretations and responses to failure. It was cool and I wasn't really that afraid to go on stage!

Links add napkin, o11ycon logo, my slides, photos