Senior Site Reliability Engineer
Cloud and Platform Engineering
Netflix is the world's leading streaming video service, and our
growth is accelerating. At Netflix, we are building new cloud management
tools, pushing the limits of cloud-based technologies, and powering our
explosive growth while at the same time improving the availability and
reliability of our services.
In this role, your mission is to improve the availability of our
distributed and cloud-based service. You can accomplish this by:
- Building automated alerting and visibility tools for Netflix Engineering teams
- Being the call leader for a service with millions of customers
- Working with individual service teams to adopt best practices for improving availability
- Extending the Simian Army (http://techblog.netflix.com/2011/07/netflix-simian-army.html)
- Inventing new best practices within our environment
About you:
You have been part of an operations or software engineering group
that cared about getting that extra 9 of availability. You are able to
jump on top of an outage, see it through to resolution, then ask the
right questions to prevent the problem going forward. You believe that
automation is the only way to scale out a service and that any manual
effort needs to be scrutinized, even if it is a 'one-off'.
While we proactively seek out candidates that are familiar with our
current stack, we care more about hiring people that can learn new
technologies and adapt quickly.
Technologies we use:
- Linux on Amazon Web Services
- Git and Jenkins
- Python and Groovy for tool building
- Cassandra for scalable persistence
- More metrics than you can shake a stick at!
No comments:
Post a Comment