Site Reliability Engineer
Join us in helping the world’s most admired brands innovate and deliver great customer experiences
WHY WORK HERE?
With solid leadership, a clear growth path and a wealth of expertise, we foster a collaborative environment and welcome those who want to work with like-minded talent with a modern technology stack. We embrace positive change and open communication. The long-term tenure of our team is a testament to our commitment to the growth of our employees, the success of Message Broadcast and our valued clients.
We are looking for a Site Reliability Engineer and the successful candidate should have a strong aptitude for learning new technologies and the ability to drive complex and meaningful projects to conclusion. Tight-knit collaboration with the software and network engineering teams and an ability to thrive under pressure are key skills required to succeed in this role. This individual should be self-motivated and have a passion for quality.
Operational Performance and Stability
Works with other members to monitor applications/platforms to ensure they are meeting performance and stability requirements.
Monitor traffic to multiple data centers to ensure proper balancing, availability, and efficient resource utilization
Monitor real time system performance for availability and SLA requirements
Analyze trends in hardware resource consumption, network latency, software errors and application logging
Monitors and Metrics
Works with Application Development and Network Engineering to ensure that applications/platforms have the appropriate monitoring and metrics in place to appropriately measure performance and stability.
Identify key areas where software needs to expose elements for performance measurement and debugging
Identify key areas in network and hardware systems where proper monitoring will quickly identify performance problems
Ensure monitoring provides holistic view of system health and availability
Configure dashboards to provide quick visibility of problems or trends that might indicate pending performance issues
Ensures that applications/platforms are Operationally ready for Production.
Ensures that new applications/platforms have adequate monitoring and that monitoring has been tested to expose areas that may impact reliable operations
Review any new Feature launch or other significant change that may impact monitoring
Write SOP/Knowledge Article for new features and update any affected support documentation
Training of Network Operations Center (NOC) and Application 1st level Support on new SOPs
Performs Post-Incident Reviews of all Major Incidents and determining Action Items required to avoid similar issues/minimize downtime for future Incidents.
Bachelor’s Degree in Computer Science or equivalent and 4 years of relevant work experience
2+ years of SRE/DevOps/infrastructure experience
Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Datadog, Elasticsearch, Graylog, Kibana, Logstash, Nagios, etc.)
Experience with Tracing tools (Open Trace, Wire Shark)
Experience configuring and monitoring containerized deployments (Docker, Swarm)
Full Stack troubleshooting experience including networking, operating system (CentOS), Nginx, DNS, and load balancing
Knowledgeable with Node.js, Redis, Mongo, RabbitMQ
On-site full-time position with reasonable flexible work schedule, once established and approved by manager
Fully covered Medical, Dental, and Vision for employee
14 days PTO
Well-stocked kitchen with energy drinks and other snacks
Onsite gym with showers
Message Broadcast is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, or any other characteristic protected by law.
***Message Broadcast does not provide visa sponsorship, transfer or assistance***