top of page

The Accidental Site Reliability Engineer

Throughout my IT career I have always had an affinity for wanting to see the big picture, the desire to control and maintain, how the systems connect. Of course there are those whom are experts at this or that and there are others that are astute at multiple aspects of IT. I myself wanted to be an expert in network engineering and infrastructure, I had the background initially from my days in the United States Air Force, so the transition into a civilian career field that came close made since. I learned however that the civilian world, mainly at start-ups require a taster for disorganization, and the ability to enjoy wearing multiple hats, whether you like it or not. One of the first start-ups that I was at had great clients and was on the rise. The company was a “Software As A Service” company, so of course there was usually band of coders, ops and configuration teams. There were constantly challenges with the reliability of the service, the usual 503, 504, m404 errors and more, of course most times, rebooting the servers, bought the site up back to normal, but that doesn’t stop the client panic and essentially clients wanted a long term solution. This seems like a fairly typical task, but when dealing with different skill sets and personalities it is not at clear cut as one might think. Anybody in the IT field knows that there is a little bit of a rivalry at times between Dev and Ops, when something goes wrong both point fingers at each other. What I learned at this small start-up was that the blame lies with both sides, neither have a solid understand of how one affects the other. Ops teams understand infrastructure and hosting the platform, but they have no understanding of how the platform is written and created. Meanwhile Dev teams understand how to write code, but don’t understand that just because it works on one server doesn’t mean it will on the other, or the full process of deployment itself at times is a mystery. I chose to go a different route, by deciding to learn the basics of coding, and learn how to code, this way I can see from both angles when troubleshooting and problem solving. At the time, this wasn’t normal Sys Admins, infrastructure people didn’t get into Dev, but that is changing, in fact Google is one of the lead companies when it comes to changing that. They’ve essentially pioneered what many are calling Site Reliability Engineers, this men and women are part-coder and part-sys admin, so they can see from both sides. This enables troubleshooting to be more efficient and teams work more integrated to understand each other. Google actually wrote a book about this and how their production systems work called “Site Reliability Engineering: How Google Runs Production Systems”. I believe if you are in the field it is a must read. Check out the explanation video below from Google Students:

Single post: Blog_Single_Post_Widget
bottom of page