Your Role:
Identify, troubleshoot, resolve, and escalate incidents quickly and effectively.
Monitor platform health operationally.
Address platform end-user problems.
Develop tools, operational enhancements, and automated solutions.
Perform root cause analysis and resolve problem patterns, developing automated and self-healing solutions.
Participate in outage conference calls.
Write clear documentation of the environment and operational procedures.
Be a part of a 24×7 shifting rotation.
Your Qualifications:
Strong sense of ownership, customer service, and integrity with excellent written/verbal communication skills.
Ability to work through complex engineering obstacles using debugging and problem-solving skills.
Deep working knowledge of the Linux operating system, including disk/file system, process, resource, user, package, and service management.
Experience troubleshooting server hardware using OS native tools.
Operational experience with containerized systems and cluster management technologies like Docker and Kubernetes.
Passion for automating repetitive manual processes.
Fluency in scripting languages such as Bash or Python.
Understanding of standard network protocols and components ( TSL/SSL, DNS, ECMP, TCP/IP, ICMP, OSI model, subnetting, load balancing).
Experience with logging, telemetry, and monitoring tools like Splunk or Prometheus.
Bachelor's degree or higher education.
Plus Points:
RedHat Certified System Administrator.
Understanding of Linux kernel space, memory process, threads, static and shared libraries, interprocess communication, and signals.
Experience managing large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible, or Salt.
Experience with APIs and serialized formats like JSON and YAML