Collabera
Site Reliability Engineer - Grafana/Prometheus
Job Location
pune, India
Job Description
Job Title : Site Reliability Engineer. Location : Pune India (Hybrid, 2-3 days in a week onsite). Job Description : We are looking for a Senior Site Reliability Engineer (SRE) to join our growing team. In this role, you will focus on building and maintaining highly reliable, scalable, and efficient systems. You will implement best practices around monitoring, observability, automation, and performance testing, while working with various teams to ensure our services run smoothly and efficiently. Key Skills Required : Primary Skills : Prometheus, Grafana, Splunk, Datadog, Observability tools, SRE, CI/CD, Jenkins, Data pipelines, Jmeter/K6/Gatling/LoadRunner. Secondary Skills : Cloud platforms (AWS, GCP), Kubernetes, Monitoring, Alerting, Infrastructure as Code (IaC). Key Responsibilities : - Site Reliability Engineering : Implement SRE best practices, focusing on ensuring system reliability, scalability, and performance across production environments. - Monitoring & Observability : Lead efforts around setting up and optimizing observability tools (Prometheus, Grafana, Splunk, Datadog) to proactively monitor system health and ensure issues are identified and addressed promptly. - CI/CD Pipelines : Manage and enhance CI/CD workflows using Jenkins to improve efficiency, reliability, and scalability of the development-to-production pipeline. - Performance Testing : Lead the use of performance testing tools like JMeter, K6, Gatling, or LoadRunner to simulate real-world traffic and identify performance bottlenecks. - Data Pipelines : Work with data engineering teams to design, build, and maintain reliable and scalable data pipelines for efficient processing and analysis. - Cloud Infrastructure : Manage cloud infrastructure (AWS, GCP) and ensure high availability, reliability, and cost-efficiency. - Kubernetes & Containerization : Work with Kubernetes to deploy, manage, and optimize containerized applications at scale, ensuring system stability and scalability. - Alerting & Incident Management: Design and configure effective monitoring, alerting, and incident response processes to quickly identify and address system issues. - Automation & Infrastructure as Code (IaC) : Automate infrastructure provisioning and management using IaC tools like Terraform, CloudFormation, or Ansible. - Collaboration & Leadership: Work closely with cross-functional teams to define and implement system reliability goals, guide junior engineers, and share best practices. Primary Skills : - Prometheus & Grafana : Extensive experience with Prometheus for monitoring and Grafana for creating insightful dashboards to visualize key metrics and system health. - Splunk : Proficient in using Splunk for log aggregation, troubleshooting, and identifying anomalies across systems. - Datadog : Experience with Datadog to provide full-stack monitoring, log management, and application performance monitoring. - Observability Tools : Strong background in utilizing various observability tools and setting up end-to-end monitoring for distributed systems. - SRE Best Practices : In-depth understanding of Site Reliability Engineering principles, including SLIs, SLOs, and error budgets, to maintain system reliability. - CI/CD with Jenkins : Hands-on experience with Jenkins for automating build, test, and deployment processes. - Data Pipelines : Experience in designing and managing robust data pipelines for efficient processing and real-time data flow. - Performance Testing Tools : Expertise with tools like JMeter, K6, Gatling, or LoadRunner for performance testing and identifying areas for system optimization (ref:hirist.tech)
Location: pune, IN
Posted Date: 1/22/2025
Location: pune, IN
Posted Date: 1/22/2025
Contact Information
Contact | Human Resources Collabera |
---|