Position not open anymore
Responsibilities
- Design, implement, and maintain high-load, highly available systems with a focus on reliability, scalability, and performance(finding and eliminating bottlenecks)
- Collaborate with teams to develop and optimize processes, tools, and automation for efficient incident response, monitoring, and continuous improvement
- Integrate reliability best practices into key product components and infrastructure;
- Engage in the company’s developer and Infrastructure community, promoting the adoption of best practices in observability, performance tuning, and automation. Actively participate in incident resolution, root cause analysis, and post-mortem processes, driving improvements to system reliability, fault tolerance, and operational efficiency across the company
-
On-call duty
Qualifications
- Deep knowledge and understanding of Linux OS (we use Ubuntu, Amazon Linux)
- Knowledge and experience with Kubernetes clusters; Experience with public and private clouds (AWS, GCP)
- Deep knowledge of Docker / Containerd containerization technologies
- Experience in building CI/CD Pipelines
- Experience in using Ansible / Saltstack and writing roles / playbooks / states / pillars
- Ability to independently decompose tasks and bring them to the end.
- Direct communication with developers.
Conditions & Benefits
- Stable salary, official employment
- Health insurance
- Hybrid work mode and flexile schedule
- Relocation package offered for candidates from other regions
- Access to professional counseling services including psychological, financial, and legal support
- Discount club membership
- Diverse internal training programs
- Partially or fully payed additional training courses
- All necessary work equipment