Bandwidth Site Reliability Engineer in Raleigh, North Carolina, United States

Job Information

Bandwidth Site Reliability Engineer in Raleigh, North Carolina

Apply Now Site Reliability Engineer at Bandwidth

Raleigh, NC

Site Reliability Engineer (Raleigh, NC) Duties: Work closely with leadership and internal partners to ensure that software meets security, SLA, performance, and capacity requirements. Set up and maintain monitoring tools and systems to detect issues using Datadog Monitors and Alert using OpsGenie. Configure Datadog and Grafana alerts and Application Health Monitors to notify the team when anomalies or problems occur. Work closely with other Site Reliability Engineers, DevOps Engineers, and System Administrators to achieve common goals. Analyze system performance data using Snowflake to plan for capacity upgrades or optimizations. Ensure the system can handle expected growth in traffic and data using the tools by getting the Lags and behavior of the Application. Manage Kubernetes clusters and OpenShift environments for deploying and scaling containerized applications. Implement and manage infrastructure using Ansible and maintain version-controlled infrastructure code using Gitlab for consistency and repeatability. Use Terraform and Ansible scripts to define and provision infrastructure resources in a repeatable and automated manner. Create and maintain Ansible playbooks to automate routine tasks, configurations, and deployments. Use GitHub Actions for CI/CD activities to continuously build and deploy the code and implement CI/CD pipelines to streamline application updates. Build and maintain deployment pipelines using the Ansible Playbooks and ensure smooth and reliable deployments, rollback procedures, and create production releases using Service Now for Tracking the Records. Maintain detailed documentation on system architecture, configurations, and processes using Confluence and Share knowledge and best practices with team members. Plan for resource allocation using Red Hat OpenShift including servers, storage, and network capacity, following the Kubernetes Architecture to ensure the system is equipped to handle traffic spikes and growth. Develop and test disaster recovery plans to ensure data and service availability in case of major failures or disasters by creating the tools using the Go. Work closely with development teams to promote a DevOps culture and ensure reliability is built into software from the start by following best practices. Collaborate with other Site Reliability Engineers to share knowledge and solve complex problems on a weekly basis and touch base all the points. Monitor and manage cloud resource costs in AWS to optimize spending while maintaining performance.

Required: Master’s degree or foreign equivalent in Computer Science, Electrical Engineering, or related field of study plus 2 years of experience in the job offered or related position. Must have experience 2 years of experience with: Infrastructure and networking concepts including virtualization, load balancing, and DNS. At least one of the following cloud infrastructure technologies AWS, Google Cloud, Azure. REST APIs using at least one or more of the following (JSON, XML, YAML). Designing, building, and operating large-scale production systems. Continuous Integration and Continuous Deployment (CI/CD) concepts and technologies using at least one or more of following (Jenkins, GHA, Circle). Containerization technologies (Docker, Docker Compose, Docker Swarm, Kubernetes). Configuration and management techniques in large distributed environments. Monitoring and observability techniques with at least one or more of the following tools Datadog, Sensu, New Relic, Nagios. General use of open-source databases MySQL, Postgres, Redis, Cassandra. Unix/Linux administration, troubleshooting and shell scripting. At least one or more of the following programming languages Python, Java, Go, Rust, or similar. Source control (Git, GitHub) and feature branching strategies. Automating infrastructure, testing, and deployment using tools Ansible, Chef, or Terraform. Infrastructure as Code paradigm.

Or in the alternate will accept a Bachelor’s degree or foreign equivalent in Computer Science, Electrical Engineering or related field of study plus 5 years of experience in the job offered or related position. Must have experience 2 years of experience with: Infrastructure and networking concepts including virtualization, load balancing, and DNS. At least one of the following cloud infrastructure technologies AWS, Google Cloud, Azure. REST APIs using at least one or more of the following (JSON, XML, YAML). Designing, building, and operating large-scale production systems. Continuous Integration and Continuous Deployment (CI/CD) concepts and technologies using at least one or more of following (Jenkins, GHA, Circle). Containerization technologies (Docker, Docker Compose, Docker Swarm, Kubernetes). Configuration and management techniques in large distributed environments. Monitoring and observability techniques with at least one or more of the following tools Datadog, Sensu, New Relic, Nagios. General use of open-source databases MySQL, Postgres, Redis, Cassandra. Unix/Linux administration, troubleshooting and shell scripting. At least one or more of the following programming languages Python, Java, Go, Rust, or similar. Source control (Git, GitHub) and feature branching strategies. Automating infrastructure, testing, and deployment using tools Ansible, Chef, or Terraform. Infrastructure as Code paradigm.

Submit resumes to: Bandwidth, Inc, 2230 Bandmate Way, Raleigh, NC 27607, Attn: Kellie Sigmon, Sr. Manager People Services or apply at www.bandwidth.com/careers/openings/. Must reference “Site Reliability Engineer” when applying.

#LI-DNI

#LI-DNP

Apply Now

USNLX Diversity Jobs

USNLX Diversity Careers

Search Jobs from Employers Building Diverse, Equitable, and Inclusive Workplaces

Job Information

Bandwidth Site Reliability Engineer in Raleigh, North Carolina

Current Search Criteria