SR. SYSTEMS ENGINEER - RELIABILITY ENGINEERING

AutoZone is seeking a seasoned Site Reliability Engineer for our highly critical Reliability Engineering team. As a member of the this team, you will be functioning in a world-class team respected for its innovation, execution and collaboration supporting AutoZone’s global operations. The team ensures continuity of all critical enterprise service offerings and operates 24/7/365 days a year.

The Reliability Engineering Team is expected to provide a reliable service with an enterprise level SLA and must strive for 100% customer support satisfaction. The primary objective of this team is to oversee and ensure critical applications and services provided are available and working as expected for customers and subscribers. The secondary objective is to develop and improve existing service monitoring tools through additional integrations, automation and collaboration.

Role Responsibility

The ideal candidate serves as the focal point for the success of our enterprise class offerings. Along with working on complex issues where analysis of situations or data requires an in-depth evaluation of various factors this role will be required to help level up the technical skills of the team, develop tools and automation for and assist services in automated problem resolution. The ideal candidate will have strong technical background with Linux, Azure, GCP, Windows, and VSphere as well as development skills in modern scripting language like python, ruby, or go.

Required Experience

• Strong communication skills with an ability to relay incident details expeditiously, concisely, and accurately
• Highly motivated, quality conscious self-starter that requires little to no supervision
• Investigates and resolves customer complaints and inquiries (i.e., emergency and non-emergency)
• Identify, receive, triage and act upon events and incidents coming from various SaaS services
• Consistently meets or exceeds established Command Center key performance indicators (KPI’s)
• Work per escalation, notification and incident practices
• Monitor the availability or the CI/CD environments
• Working under pressure in production environments running production customer workloads and services
• Previous knowledge or strong desire to learn about crisis management issues.
• Ability to work with geographically disperse teams part of a world wide operations team
• Success in this role requires very strong technical and communication skills, a broad background and understanding of every layer of the software development, SaaS ecosystems, and the ability to identify the issue and escalate to the correct Dev Ops team. The ability to work independently and as part of a specialized team in diverse environment is a requirement.


Qualifications

• At least 3 years of experience working with production SaaS/Cloud based systems.
• Minimum 2 years of experience with Unix/Linux OS/ Docker
• Strong organizational and multitasking skills.
• Highly responsive and proactive, able to own tasks from start to finish.
• Maintain high level of attention to detail.
• Experience in one or more of the following languages: Python, Java, Go and/or NodeJS
• Experience working with escalation applications
• Experience working with communication tools
• Excellent written and verbal communication skills
• Must be highly self-motivated and customer-centric
• Experience working with internal or external notification tools: Statuspage.io, status.io
• Experience with building and improving cross integration across tools and components to build a world class monitoring platform
• Experience administering Linux systems in a production environment
• Domain knowledge of systems management and ITIL is strongly desired.
• BS Degree in Computer Science, or a related field






  Login with Facebook


Sr. Systems Engineer - Reliability Engineering Located At TN, Memphis - Job Post By: AutoZone, Inc.






[Browse Jobs] - [Back to Previous Page] - [Random Jobs]


Site Sponsors