Todd Palino Fuses Dev and IT Ops Skills as SRE for LinkedIn
Most businesses might not have a Site Reliability Engineer now, but Palino insists many companies will want one in the near future.
Todd Palino knows a thing or two about bringing seemingly opposing worlds together.
To start, Palino worked when in high school as a computer sales person on the East Coast, and after studying computer science, navigated the high-tech world of Silicon Valley with ease. In his current role at LinkedIn, he combines his developer skills with his IT Ops experience to help create what is known on the West Coast as a Site Reliability Engineer, or SRE.
While SRE may not be a household—or rather datacenter—name just yet, Palino is confident IT organizations around the world will soon be clamoring to find the person for their team that can bring together skills that run the gamut of operations, development and leading-edge technology.
In this Q&A, Palino shares his take on the must-have new technology role, why businesses need an SRE and how he makes use of his varied skills to better serve his team at LinkedIn.
Modern Software Factory Hub: What is a Site Reliability Engineer? Explain this role.
Todd Palino: It’s a new title to a lot of people. Site Reliability Engineer is very much a West Coast invention, and if you’re not in the Bay Area, you tend not to know what an SRE is. Or you may have heard of it, but you’re not really sure what an SRE does and how it’s different from a traditional operations role.
There are a lot of different ways to describe it, but I consider it to be a particular discipline of DevOps. Essentially, it combines roles that many of us in the operations fields were already doing—architect, tools developer, and operations—into one. An SRE is responsible for all of these things, with a focus on automation and developing the tooling so that you stop doing the reactive work. You’re constantly focusing on the proactive work instead.
MSF Hub: How does your role as SRE help LinkedIn overall?
TP: SRE is the glue that binds the entire organization at LinkedIn together. You have product teams and you have developers, but SREs are the people who know how the pieces all fit together. They create the pipelines that the developers use. So, you can have developers who use DevOps practices, but they can’t without a proper toolset. That’s one of the things that SRE brings to the table: developing and maintaining that toolset.
MSF Hub: Does the SRE role lessen the amount of IT firefighting that needs to be done?
TP: Yes, and we need it too because the systems that we’re working on get larger and larger. I run in excess of 2,000 servers with a team of only three or four engineers in the U.S. That’s my SRE team, and we are running petabytes of data per day through Apache Kafka. We have numerous services that we run, but we’re doing it with only a few people because we have automation to support us.
MSF Hub: What would you say to those who fear automation technology will replace humans and lead to fewer jobs?
TP: Honestly, this is a transition that’s been going on for years now—manual jobs being eliminated by technology and automation. Technology is increasingly taking over roles that were traditionally done by people and we’ve seen it in nearly every industry. Manufacturing is a big example, but now we're seeing it in fast food. Fast food workers are disappearing, and computers are taking over the job. You’re going to see the rideshare business get decimated by self-driving cars. The fact of the matter is, this is all being driven by improvements in technology. For me, this is exciting to see from a personal career standpoint but it’s driving change in numerous industries.
The way to weather this change is to continually challenge ourselves, and there’s no difference whether we are talking about a manufacturing job or automating a systems administration process. We should not be trying to halt this progress, but rather we should be the ones creating and embracing it. Then we make our work about maintaining the automation, and finding the next challenge.
MSF Hub: How does an SRE align with the technology change happening across industries?
TP: When working in SRE, you must develop the mindset to let things go. A lot of IT ops people with a fixed mindset, often people in long-standing systems administration roles, want to hold on to the processes too tightly. They believe that the manual work they do is critical to the existence of their job. So, they can’t let a developer take it over. SREs not only are developers, but we have the operational expertise where we can develop with a sense of what’s going to work in production. Especially in an embedded SRE role, you are working with developers who don’t necessarily have that experience, and you’re helping to inform their development process by telling them how their application is going to work in production. This comes from the experience we’ve had previously in operations. We also help them to get their applications deployed so they can focus on developing the code and not on figuring out how to use the DevOps tool chain: how to get hardware, how to get resources assigned, how to get firewall rules, and everything else that you need to make the software work.
Video: Todd Palino on How the New Role of Site Reliability Engineer is Redefining Operations in a DevOps World
MSF Hub: Does it take a certain work culture to bring an SRE on board?
TP: Yes, and it was difficult for me at first because I was an East Coaster for a very long time before I joined LinkedIn, and the culture is completely different. The open and honest culture of LinkedIn drives a blameless culture. It drives an environment where you know that when something gets messed up—when you make a mistake—that you’re not going to be personally blamed for it. We’re very careful when we’re creating incident tickets: we don’t put names in the comments for those tickets. We know who caused the problem, but we don't care that they caused the problem. We care about why that problem happened. Were they following a process that wasn’t good? Did they not follow the process? Why didn’t they follow the process? How can we improve that the next time? Can we improve the situation with technology? It really is all about not blaming people and keeping people involved in the discussion.
MSF Hub: How is the success of an SRE or a team of SREs measured?
TP: Everything’s about the data. If you have fewer site issues, then you’re being successful. If you are getting your releases out faster, you’re being successful. These are things that you can measure and they’re things that we do measure. LinkedIn tends to be a very data-driven organization, to the point that we have hourly reports that go out to executives with all kind of metrics on site growth, stability and many other aspects about the platform. When it comes to measuring whether or not SRE is successful, it’s data as well. Most of it is quantifiable while some parts, like culture and morale, aren’t as easy to measure.
MSF Hub: Are team members happier or more fulfilled working in this type of culture?
TP: Speaking from my current position with LinkedIn’s SRE team, I would say that on the whole, the company’s SRE organization is quite happy. We have a healthy culture, enjoy the work we are doing and the colleagues that we are doing it with. Of course, all organizations have turnover, but we choose to celebrate the “next play”, fully supporting teammates who move on to their next challenge, whether it is within LinkedIn or at another company.
It’s very difficult to bring a culture like SRE into entrenched organizations that don’t have some of the building blocks. It really is a matter of having top-down support in the organization for that type of culture. Developer happiness is hard to measure, but you can still source feedback from teams.