Site Reliability Engineer - SRE Job in Dallas

774 IT & Software Developer jobs in the US

Company Size

<50

Company Type

Services

Exp Level

Junior

Job Type

Full-Time

Language

English

Visa sponsorship

Requirements

Must:

- 3-5 years of relevant experience in site reliability, infrastructure, or DevOps engineering - Strong proficiency in monitoring and observability tools such as Dynatrace, Grafana, Prometheus, Splunk, or equivalent - Experience with incident management systems and event correlation platforms like BigPanda, ServiceNow, or Moogsoft - Solid knowledge of Linux/Unix systems (RHEL) and Windows Server environments - Practical experience with cloud platforms including AWS, Azure, or OpenShift - Expertise in containerization and orchestration technologies: Kubernetes, Docker, OpenShift - Familiarity with chaos engineering and fault injection frameworks (e.g., Litmus, Gremlin, AWS FIS, Azure Chaos Studio) - Comprehensive understanding of networking concepts, database systems (Oracle, SQL), and distributed architectures - Experience with event streaming platforms (Kafka) and service mesh technologies (Istio) - Acquainted with mainframe systems and legacy infrastructure - Familiar with infrastructure as code practices and automation tools - Knowledge of job scheduling systems (CA7 or similar) and middleware tech - Proficient in tools like Jira, Confluence, and ITSM platforms - Preferably experienced in highly regulated sectors, such as financial services - Relevant certifications are valued, including AWS/Azure architecture, RHCE, VCP, and Kubernetes (CKA/CKAD) - Strong analytical skills with effective problem-solving and troubleshooting abilities - Excellent verbal and written communication skills for collaboration across teams

Technologies

AWS

Lambda

Azure

Cloud

Confluence

Datadog

Dynatrace

Istio

ITSM

Responsibilities

- Coordinate responses to critical events alongside application support teams and the Site Reliability Center - Triage and address alerts generated through the BigPanda event correlation platform - Evaluate cross-domain impacts and collaborate with or escalate to appropriate support teams as necessary - Participate in on-call rotations for 24/7 coverage of vital systems - Conduct blameless post-mortems and root cause analyses to promote continuous improvement - Design and implement automated monitoring and alerting solutions using Dynatrace, Grafana, Logscale, CrowdStrike, Prometheus, Splunk, Moogsoft, and Datadog - Develop comprehensive dashboards and enforce SLAs/SLOs through robust monitoring practices - Analyze operational metrics from systems and applications for performance tuning and fault identification - Implement chaos engineering techniques utilizing Litmus, Gremlin, Azure Chaos Studio, and Chaos Mesh - Design fault injection experiments to validate system resilience via AWS Resilience Hub - Establish self-healing capabilities and automated remediation workflows - Conduct health checks and deploy autoscaling solutions using AWS Lambda, Kubernetes, OpenShift, and Istio service mesh - Manage infrastructure across mainframe, Windows, RHEL, and cloud platforms (AWS, Azure, OpenShift) - Work with containerized environments, event streaming platforms (Kafka), and database systems (Oracle, SQL) - Oversee virtualization infrastructure (VMware) and storage solutions (NAS) - Use ServiceNow for incident management, Jira for issue tracking, and CA7 for job scheduling - Identify opportunities for enhancing application stability and promote SRE best practices - Maintain thorough knowledge bases and runbooks in Confluence - Mentor junior team members on resiliency strategies and operational excellence

Description

At Ellofant, we are a progressive consulting firm dedicated to empowering organizations to navigate transformation and growth. We pride ourselves on our straightforward approach, emphasizing effective outcomes over unnecessary complexity. Our clients depend on us for more than just strategic insights; we aid in constructing systems, launching innovative products, and achieving meaningful results. We are searching for inquisitive and ambitious individuals eager to tackle real challenges with tangible impacts. Join us, and you may find your ideal next step in a vibrant environment enriched by the dynamic tech and financial services landscape of Dallas. We offer competitive compensation along with attractive benefits, including comprehensive health coverage, retirement savings options, and paid time off, ensuring our team members thrive both professionally and personally.

Something wrong or incorrect with this job? Tell us in the chat 💬 on the right ➡️

IT & Software developer jobs in the USDevOps Engineer jobs in the USDevOps Engineer jobs Dallas, TX

You can find DevOps salaries in the United States here.

How many DevOps jobs are in the United States?

Currently, there are 774 DevOps openings. Check also: Cloud jobs, AWS jobs, Azure jobs, GCP jobs, Kubernetes jobs, Docker jobs, Terraform jobs - all with salary brackets.

Is the US a good place for DevOps?

The US is one of the best countries to work as a DevOps. It has a vibrant startup community, growing tech hubs and, most important: lots of interesting jobs for people who work in tech.

Which companies are hiring for DevOps jobs in the United States?

D3 Security Management Systems, Nurse Next Door, Snaplii, LYNKED Inc., Clarence Farm Services Ltd., DataAnnotation, Studio 3 Marketing among others, are currently hiring for DevOps roles in the United States.

The company with most openings is Peraton as they are hiring for 43 different DevOps jobs in the United States. They are probably quite committed to find good DevOps.