This Position is Closed
This job is no longer accepting applications. Check out similar opportunities below or browse all active jobs.
Principal Database Reliability Engineer
Posted 1 months ago
Full-Time
Employment Type
Remote
Work Location
About This Role
Join Udemy. Help
define
the future of learning.
Udemy is an AI-powered skills acceleration platform built to help people and teams grow. It’s personalized, practical, and focused on real-world impact.
Our mission is simple: to transform lives through learning. Your work helps people around the world build skills they can use, whether they’re picking up something new or leveling up to stay ahead.
Over 80 million learners and 17,000 businesses already learn with Udemy. If you’re excited by change, energized by learning, and ready to have a real impact, you’ll feel right at home.
Learn more about us on our
company page
.
Principal Database Reliability Engineer
About This Role
As part of Udemy's Platform team, the Datastore Infrastructure (DSI) team is responsible for overseeing all aspects of Databases (MySQL, Aurora, DynamoDB), Message Queues (RabbitMQ), Streaming (Kafka), and Caching (Redis, Memcache) in our infrastructure. This includes ensuring uptime, security and compliance, observability, performance, improving developers' productivity and developing future growth strategies. The team is split between EU and US regions. You will play a vital role in overseeing day-to-day activities and engineering strategies of DSI, ensuring that millions of students worldwide achieve greater learning and career outcomes on Udemy. We value teamwork, a good sense of humor, strong ownership, technological curiosity, and a desire to learn.
To be successful in this role, you will collaborate closely with engineering, product, and a diverse set of stakeholders around the world. You are not just interested in maintaining systems but also writing the software that maintains them. You strongly believe in a no-blame culture and advocate for humane on-call practices. You constantly seek opportunities for improvement and thrive in an environment where you can drive positive change.
What You'll Be Doing
Lead improvement projects for our datastores and platform teams to align with the company’s long term objectives.
Maintain Infrastructure Uptime, monitor performance, and ensure infrastructure continues scaling as we grow.
Develop Immutable infrastructure patterns, and automate Infrastructure provisioning via Code (Terraform, Python, Ansible etc ..)
Ensure adherence to PCI and ISO27001 compliance as well as SOC 2 security requirements, modifying CI/CD processes when necessary, and upholding policies and standards.
Advocate for and implement positive changes in tools and processes through healthy discussions.
Participate in the on-call rotation, demonstrating a systematic approach to incident management.
Participate in day-to-day activities, support requests, and project-related tasks for the team.
Contribute to documentation, maintain ticketing queues, provide project support, troubleshoot, and offer after-hours assistance as required
Provide coaching and mentorship to new hires, fostering their technical growth and integration into the team. Maintain close communication with team members throughout their tenure.
What You’ll Have
We do not expect you to have all the below, but the more mix/max skills you have the easier you will onboard
8-10 years of professional experience working in a Cloud Engineering team (also SRE/DBRE team) with Infrastructure responsibilities in managing large production workloads.
Proficiency with managing MySQL at scale (Horizontal Scaling, sharding, InnoDB optimizations, Query Optimization, HA/DR, Monitoring, Backups Strategy, Security, Automations).
Strong understanding in running Production Workloads in Kubernetes
Proficiency with tools like Terraform, Ansible, Git and how to work with Infrastructure as Code, and automated provisioning.
Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring high availability and fault tolerance. Experience with MSK is also good.
Experience With Message Queues (mq/sqs) And Caching (redis, Memcache) Or Similar Products
Experience In Python.
Knowledge of configuration management tools, monitoring systems (Datadog or similar) for database infrastructure, and scaling strategies for handling increased data volumes.
Strong troubleshooting skills to diagnose complex database issues.
Hands-on experience with AWS cloud infrastructure and a grasp of security best practices.
Adaptability and comfort working in a fast-paced, hands-on environment.
Nice To Have
Experience With Any Additional Programming Languages (golang, Kotlin, Java)
Experience In Implementing Cdc Pipelines For Reliable Data Replication And Synchronization
Experience With Vitess Operator Running Mysql On Kubernetes.
Experience With Writing Kubernetes Helm Charts.
Experience With Tools Like Argocd/argo Workflows, Or Similar Alternatives In Various Combinations.
Knowledge of security standards, vulnerability patching, TLS/SSL and related..
Any additional experience or familiarity with related technologies would be advantageous.
infrastructure. This includes ensuring uptime, security and compliance, observability, performance, improving developers' productivity and developing future growth strategies. The team is split between EU and US regions. You will play a vital role in overseeing day-to-day activities and engineering strategies of DSI, ensuring that millions of students worldwide achieve greater learning and career outcomes on Udemy. We value teamwork, a good sense of humor, strong ownership, technological curiosity, and a desire to learn.
To be successful in this role, you will collaborate closely with engineering, product, and a diverse set of stakeholders around the world. You are not just interested in maintaining systems but also writing the software that maintains them. You strongly believe in a no-blame culture and advocate for humane on-call practices. You constantly seek opportunities for improvement and thrive in an environment where you can drive positive change.
What You'll Be Doing
Lead improvement projects for our datastores and platform teams to align with the company’s long term objectives.
Maintain Infrastructure Uptime, monitor performance, and ensure infrastructure continues scaling as we grow.
Develop Immutable infrastructure patterns, and automate Infrastructure provisioning via Code (Terraform, Python, Ansible etc ..)
Ensure adherence to PCI and ISO27001 compliance as well as SOC 2 security requirements, modifying CI/CD processes when necessary, and upholding policies and standards.
Advocate for and implement positive changes in tools and processes through healthy discussions.
Participate in the on-call rotation, demonstrating a systematic approach to incident management.
Participate in day-to-day activities, support requests, and project-related tasks for the team.
Contribute to documentation, maintain ticketing queues, provide project support, troubleshoot, and offer after-hours assistance as required
Provide coaching and mentorship to new hires, fostering their technical growth and integration into the team. Maintain close communication with team members throughout their tenure.
What You’ll Have
We do not expect you to have all the below, but the more mix/max skills you have the easier you will onboard
8-10 years of professional experience working in a Cloud Engineering team (also SRE/DBRE team) with Infrastructure responsibilities in managing large production workloads.
Proficiency with managing MySQL at scale (Horizontal Scaling, sharding, InnoDB optimizations, Query Optimization, HA/DR, Monitoring, Backups Strategy, Security, Automations).
Strong understanding in running Production Workloads in Kubernetes
Proficiency with tools like Terraform, Ansible, Git and how to work with Infrastructure as Code, and automated provisioning.
Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring high availability and fault tolerance. Experience with MSK is also good.
Experience With Message Queues (mq/sqs) And Caching (redis, Memcache) Or Similar Products
Experience In Python.
Knowledge of configuration management tools, monitoring systems (Datadog or similar) for database infrastructure, and scaling strategies for handling increased data volumes.
Strong troubleshooting skills to diagnose complex database issues.
Hands-on experience with AWS cloud infrastructure and a grasp of security best practices.
Adaptability and comfort working in a fast-paced, hands-on environment.
Nice To Have
Experience With Any Additional Programming Languages (golang, Kotlin, Java)
Experience In Implementing Cdc Pipelines For Reliable Data Replication And Synchronization
Experience With Vitess Operator Running Mysql On Kubernetes.
Experience With Writing Kubernetes Helm Charts.
Experience With Tools Like Argocd/argo Workflows, Or Similar Alternatives In Various Combinations.
Knowledge of security standards, vulnerability patching, TLS/SSL and related..
Any additional experience or familiarity with related technologies would be advantageous.
We understand that not everyone will match each of the above qualifications. However, we also realize that everyone has unique experiences that can add value to our company. Even if you think your background might not perfectly align, we'd love to hear from you!
Posting Date
November 05, 2025
Application Window
November 05, 2025 - December 05, 2025
At Udemy, we strive to be transparent around compensation. Actual compensation for this role is based on several factors, including but not limited to job-related skills, qualifications, experience, and specific work location due to differences in the cost of labor. In addition to a base salary, this role is also eligible for equity.
Hiring Compensation Range
$184,000
—
$230,000 USD
Why Work Here?
You’ll grow here.
Learning is part of the job. You’ll get full access to Udemy courses, a monthly UDay to invest in yourself, and a budget to spend on whatever helps you improve. Many people are diving into AI lately, but what you focus on is
Apply to Multiple Jobs with AI
Let our AI automatically apply to hundreds of remote jobs on your behalf. Just upload your resume and set your preferences.
500+
Jobs Applied
24/7
Auto-Apply
5 min
Setup Time
Similar Active Opportunities
Forward Deployed Engineer
Who Are We? Postman is the world’s leading API platform, used by more than 45 million+ developers and 500,000 organizations, including 98% of the Fort...
About Vercel: Vercel gives developers the tools and cloud infrastructure to build, scale, and secure a faster, more personalized web. As the team behi...
Staff Engineer, Identity
Who Are We? Postman is the world’s leading API platform, used by more than 45 million+ developers and 500,000 organizations, including 98% of the Fort...
