Site Reliability Engineer (SRE)
Company: Merck Gruppe - MSD Sharp & Dohme
Location: Rahway
Posted on: February 18, 2025
Job Description:
Job DescriptionOur Company is where we transform vision into
reality. It's where ideas become technologies, and cutting-edge
technologies become solutions for animal care and management.We
support farmers by providing real-time actionable information to
help them manage their herds. It provides pet owners with smart
devices and data that give them a better understanding of their
pets' activity and health needs, enriching relationships. It helps
conservationists safeguard natural environments and
wildlife.Leveraging decades of Technological Research & Development
experience across many markets, technologies and species, along
with development environments and Quality Assurance procedures,
we're always inventing new ways to look after the health and
well-being of animals. Our decades of experience keep us ahead of
the curve by leveraging advanced Technological Solutions from
enhancing the precious bond between people and their pets, to
advancing animal healthcare and wildlife preservation.Job
Description:We are looking for a Site Reliability Engineer (SRE) to
lead and establish the SRE domain within the organization.You will
be responsible for ensuring the reliability, availability, and
performance of our systems and applications. Collaborate closely
with Software Engineers, DevOps teams, Security teams, and Program
Managers to build and maintain scalable infrastructure, monitor
critical systems, and automate repetitive tasks to improve
efficiency and uptime. Your primary goal is to maintain an optimal
balance between system stability, feature development, and fast
delivery cycles.Key Responsibilities:
- Monitoring: Monitoring of AWS (Azure - advantage)
infrastructures using DataDog (or equivalent) using KPIs. Proven
experience with defining efficient alerts, synthetic tests,
analyzing logs (error detection), detecting issues using DataDog,
managing SLIs and SLOs, leveraging NOC activity, and defining
flows.
- Architecture Understanding: In-depth understanding of designing
distributed systems on cloud-based environments and microservices.
Understand complex cloud product architectures, including
event-driven architecture, with a focus on how data flows and
messages interact between services.
- Continuous Improvement & Documentation: Develop and maintain
technical documentation for processes, procedures, and systems;
conduct post-incident reviews and implement preventative measures
and lead Root Cause Analysis (RCA) and Incident management when
issues arise.
- Infrastructure & Cloud: Proven experience with AWS services
such as API Gateway, Lambda Functions, SQS, SNS, S3 Bucket, RDS,
Redis Cache, Kinesis, Global Accelerator, CloudFront, and Route 53,
with an understanding of most common cloud services in production
environments and IAC understanding using Terraform.
- Automation and CI/CD: Experience with Azure DevOps, GitHub
Actions, Argo, GitOps, Artifact management using Artifactory.
Ability to review pipelines and Helm charts or equivalent,
understand Automation processes. Familiarity with CrossPlan.
- Security (Preferred): Experience with Web Application Firewalls
(WAF) rules review, rate limiting on services and infrastructure
based on data analysis and collaboration with DevSecOps.Personal
Requirements:
- Bachelor's degree in computer science or equivalent proven
experience.
- 5+ years in a hands-on DevOps or SRE position.
- Strong communication skills to align, document, and share
knowledge across teams are a must when working with
cross-functional teams.
- Ability to work in high-load and lead sensitive situations and
investigations, especially when customer-facing services are
impacted.
- Great motivation for continuous learning and adoption of new
technologies and excellent problem-solving skills with a proactive
approach.Employee Status: RegularRelocation:VISA Sponsorship:Travel
Requirements:Flexible Work Arrangements: HybridShift:Valid Driving
License:Hazardous Material(s):Required Skills:Preferred Skills:
Capacity Management, Change Controls, Configuration Management
(CM), Network Design, Release Management, Software Development,
Software Development Life Cycle (SDLC), Solution Architecture,
System Administration, Systems IntegrationJob Posting End Date:
06/1/2025*A job posting is effective until 11:59:59PM on the day
BEFORE the listed job posting end date. Please ensure you apply to
a job posting no later than the day BEFORE the job posting end
date.Requisition ID: R317892
#J-18808-Ljbffr
Keywords: Merck Gruppe - MSD Sharp & Dohme, New York , Site Reliability Engineer (SRE), Professions , Rahway, New York
Didn't find what you're looking for? Search again!
Loading more jobs...