Principal Site Reliability Engineer Job in Dallas

1604 IT & Software Developer jobs in the US

Company Size

<50

Company Type

Services

Exp Level

Lead

Job Type

Full-Time

Language

English

Visa sponsorship

Requirements

Must:

- Over 12 years of experience in Site Reliability Engineering or Infrastructure Engineering - At least 5 years in leadership SRE roles, building and scaling SRE teams and processes - Proven experience in designing and implementing large-scale monitoring and observability solutions - In-depth knowledge of distributed systems, microservices architectures, and cloud-native patterns - Proficiency in infrastructure as code, configuration management, and deployment automation - Hands-on experience with Google Cloud Platform is mandatory - Expertise in the GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace) - Familiarity with GKE, Compute Engine, Cloud Functions, and other essential GCP services - Understanding of GCP networking, security, and compliance features - Knowledge of cost optimization and resource management within GCP - Strong programming skills in Python, Go, Java, or similar languages - Experience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, etc.) - Proficiency in containerization (Docker, Kubernetes) and orchestration tools - Knowledge of CI/CD pipelines, automated testing, and deployment methodologies - Understanding of database performance tuning and optimization for both SQL and NoSQL - Familiarity with AI-driven development tools and methodologies is a significant advantage - Experience in machine learning applications for AIOps, anomaly detection, or predictive analytics - Experience with automated incident response and self-healing systems - Strong analytical and troubleshooting skills for complex distributed systems - Experience in high-pressure incident response and crisis management - Detail-oriented with a commitment to operational excellence and continuous improvement - Comfortable with ambiguity and developing processes in a fast-paced environment - Passionate about reliability, automation, and engineering best practices - Proven ability in establishing SRE programs and processes from the outset is a major advantage - Bachelors degree in Computer Science, Engineering, or equivalent professional experience - Industry certifications (e.g., Google Cloud Professional, SRE, or related certifications) are preferred

Technologies

CI/CD

Datadog

Machine Learning

Marketing

NoSQL

Prometheus

Responsibilities

- Establish SRE practices from the ground up, including defining SLIs, SLOs, error budgets, and reliability metrics - Develop incident response protocols, on-call schedules, and post-mortem procedures - Create standards and best practices for reliability engineering across engineering teams - Formulate disaster recovery and business continuity plans - Design and implement frameworks for capacity planning and performance optimization - Lead architectural decisions for comprehensive application and infrastructure monitoring solutions - Develop custom SRE tools for automated monitoring, alerting, and remediation - Construct observability platforms that provide detailed insights into system performance and user experience - Build automation frameworks for deployment, scaling, and incident response - Architect logging, metrics, and tracing systems for distributed microservices ecosystems - Utilize Google Cloud Platform services to create resilient, scalable infrastructure - Implement cloud-native monitoring systems using Stackdriver, Cloud Monitoring, and Cloud Logging - Design systems that auto-scale and self-heal using GKE, Cloud Functions, and managed services - Optimize cloud costs while ensuring high availability and performance levels - Establish security and compliance frameworks in GCP environments - Research and incorporate innovative SRE tools and methodologies - Utilize AI and machine learning for predictive analytics, anomaly detection, and automated repairs - Create dashboards and reporting systems that offer actionable insights to engineering and business teams - Develop feedback loops for ongoing enhancements of reliability and performance - Remain updated on industry best practices and emerging technologies in the SRE domain

Description

At InfiniteChoice, we are dedicated to transforming the way people discover experiences. We are looking for a Principal Site Reliability Engineer to establish and drive the foundation of our Site Reliability Engineering from the ground up. This is an exciting opportunity to shape our reliability culture and develop custom tools to support a platform that serves millions of users. We offer a fully remote position for US-based candidates, allowing for flexibility and autonomy in defining processes and selecting technologies. Our collaborative environment is filled with bright, passionate engineers committed to building operational excellence. We provide competitive compensation, equity participation, and comprehensive benefits, and we invite you to be part of our journey to disrupt the experience discovery space.

Something wrong or incorrect with this job? Tell us in the chat 💬 on the right ➡️

IT & Software developer jobs in the USDevOps Engineer jobs in the USDevOps Engineer jobs Dallas, TX

You can find DevOps salaries in the United States here.

How many DevOps jobs are in the United States?

Currently, there are 1604 DevOps openings. Check also: Cloud jobs, AWS jobs, Azure jobs, GCP jobs, Kubernetes jobs, Docker jobs, Terraform jobs - all with salary brackets.

Is the US a good place for DevOps?

The US is one of the best countries to work as a DevOps. It has a vibrant startup community, growing tech hubs and, most important: lots of interesting jobs for people who work in tech.

Which companies are hiring for DevOps jobs in the United States?

Turing, micro1, Connextek, GE Aerospace, The Hartford, Casella Waste Systems, Inc., J&J Ventures, Gaming LLC among others, are currently hiring for DevOps roles in the United States.

The company with most openings is Jobot as they are hiring for 133 different DevOps jobs in the United States. They are probably quite committed to find good DevOps.