Outsmarting Outages: Bloomberg Banks on SRE for Reliability
To boost software and systems reliability, the financial data and media giant is deploying teams of site reliability engineers across its workforce.
Stig Sorensen is a man on a mission. As Bloomberg's executive sponsor of site reliability engineering (SRE), he wants to change the way that the financial and media giant builds its software and systems to keep them reliable, scalable and secure.
It's a complicated task. By necessity, Bloomberg's systems are becoming ever more complex in order to handle its massive data center's needs. Each day, the company processes 100 billion market data messages about a host of complex financial instruments, such as stocks, bonds, commodities and foreign exchanges. And it ingests more than two million news stories from 125,000 curated news sources. By comparison, Twitter handles about 500 million tweets per day, says Sorensen.
As the data deluge grows, it's Sorensen's responsibility to ensure that the complicated software that manages it (as well as its underlying infrastructure) stays reliable while keeping up with demand.
Bloomberg tapped Sorensen to guide the company's SRE push in 2016. Before that, he logged 12 years as an engineer focused on creating reliable applications software. His team, the product visibility group, is part of Bloomberg's internal infrastructure team, which provides platforms and services to developers throughout the company.
An SRE should be inherently pessimistic and assume that everything will break all the time. If you come in with that mindset and that's how you try to build systems, then you're doing it the right way.
— Stig Sorensen, Executive Sponsor of Site Reliability Engineering, Bloomberg
Operations as an Engineering Problem
As part of his mission, Sorensen must help Bloomberg's technology teams adopt a SRE approach to software and systems development. That means automating platforms and management systems to ensure reliability.
The problem for Bloomberg is that the company's infrastructure staff has traditionally managed those platforms manually. It's a challenge shared by enterprises with high-octane operations like Google that have adopted SRE.
“Sysadmins historically went machine to machine and configured them by hand," Sorensen recalls. It was a constant effort just to keep the engines running, without any pause in the daily grind to consider how to make the platforms run better.
That approach worked 10 years ago, when Bloomberg operators controlled tens of thousands of physical servers. It even worked when the company added an equal number of virtual servers in the last few years—but no longer.
Today, Bloomberg is replacing virtual servers with containers. While virtual servers are a digital copy of physical servers, mirroring all of their hardware and software, a container is a digital copy of a single application, taking up far fewer resources and supporting functions at a highly granular level. Sorensen expects the company's platforms to add 10 to 100 times as many containers as virtual servers during the coming years.
But during the transition to containers, Bloomberg will use all three kinds of servers in parallel, and tech workers must automate and standardize their management to cope with the demands of all three. The SRE initiative enables this adjustment by encouraging operations staff to step back and treat infrastructure like a long-term engineering problem.
“You now have tools to automate these things in practice. You can scale yourself far better," says Sorensen. “If you have a good script, it won't fail. If you ask a human to do the same thing a thousand times, they will definitely fail."
The SRE approach is team based, with four to five people per team. Roughly 40 SRE teams operate across the company today, providing the cross-platform service used in many Bloomberg applications. Six to nine teams handle the market data pipeline (currently the main focus of Sorensen's group), as well as administrative services, software development, virtual infrastructure creation and management, and enterprise-wide digital search services.
Good Pessimists Are Hard to Find
Spreading SRE culture is an uphill struggle, and it's still a patchy, ad hoc, and ultimately decentralized effort, Sorensen admits.
There exists no official SRE training or certification program, and team members must learn the principles of site reliability engineering on the job. Sorensen says that he hopes to introduce a SRE learning protocol in the future, but he adds that, in practice, a SRE approach is mostly a matter of attitude. “We have basically taken engineers with varied skillsets and backgrounds—application developers, network engineers and hardware engineers—and trained them on the job in the best practices of SRE," he says.
When teams recruit new members, either from within Bloomberg's ranks or externally, Sorensen advises them to look beyond CVs and seek out candidates with an aptitude for problem solving. And while most managers look for positive thinkers, he jokes that he keeps an eye out for just the opposite.
"An SRE should be inherently pessimistic and assume that everythin with break all the time," he says. "If you come in with that mindset and that's hose you try to build systems, then you're doing it the right way."
But finding those creative and strategic pessimists is no easy task today.
“We have 5,500 engineers and we're trying to hire SREs as fast as we can," Sorensen says. Of his 40 SRE teams working throughout the company, roughly 20 are hoping to bring on more engineers as soon as possible.
This growing demand is an early indicator of his mission's success: As SRE-focused automation takes root, Bloomberg's initiative is flowering into a company-wide cultural transformation.