In our cloud environment, we provision a ton of instances (a.k.a clusters) of our core product called Conductor to our users. We do this across multiple cloud providers, on-premise (both cloud-based and data center-based). This is an incredibly challenging DevOps work and what some people might call a DevOps nightmare. Most folks who offer a cloud-based solution will choose a multi-tenant, fully managed offering, making it much easier for people to manage. We chose our deployment models mainly because that’s what our customers wanted, and we wanted to be flexible to different needs.
One of the things that we do is DNS-based access to the provisioned cluster - sometimes, this is done using the customer’s own DNS name, and other times, we offer one of our many owned domains. Every cluster access is protected by TLS access, as is the norm these days. And for this, we use wildcard certificates. And this is the topic of this article.
The process of creating a wildcard certificate is straightforward - especially using providers such as Lets Encrypt and the tooling around such as cert-manager from Jetstack. Huge kudos and thanks to the teams behind these tools, and if not for them, life would be a lot more difficult and expensive.
Given our distributed architecture and countless k8s clusters across regions, cloud providers, and premises, we ran into a question. When deploying wildcard certificates, what's the best way to ensure it's replicated across all the instances? There were two simple choices:
Option 2 sounded a lot more challenging than option 1, so we went with option 1 - which is to have a single certificate managed in a single k8s cluster and then copy it over to wherever applicable. And given the LetsEncrypt pattern, we had to do this once every three months at least.
First, before we even schedule these copy processes, how can we detect the expiry of a certificate? And that, too, across the thousands of URLs we have provisioned? We decided to use the scheduler feature of Orkes Conductor, which is the primary tooling for much of our work. We scan each of our URLs once per day using our scheduling function and report into Slack if the expiry time is in 15 days. Ideally, we will renew before this time, and no alerts will be fired, but this was our fallback net.
Here is a preview of our schedules:
As you can see, we have a bunch of checks scheduled to run at 5:35 p.m. on the day that I am writing this article. If you want a scalable scheduling solution, please check our product - https://orkes.io/platform or reach out to me, and I can do a walkthrough. The scheduler is capable of running millions of schedules, so scale is not a challenge.
Now that we have a way to scan for certificates about to fail, the next challenge is to copy over the certificate every time it renews. Here is how we imagined it should work:
This is exactly how a DevOps engineer might do it manually, safely verifying if the copied certificate has updated expiry.
We obviously didn’t want to do this manually every three months! Doing things manually is error-prone and could be a really boring task. We asked Orkes Conductor to help orchestrate this!
Here is what the workflow looked like:
It is pretty much exactly what the manual copy and validation process was, but it's fully automated here. Now, we are able to use the scheduler feature and run this as frequently and for any number of clusters as required. We did this close a year back, and for the last 4-5 renewals, things have been running really well and smoothly.
This article was intended to demonstrate how an orchestration platform like Orkes Conductor can simplify and scale engineering time doing things like these. Orkes Conductor, built over the battle-tested Netflix Conductor, is an orchestration platform that simplifies workflows, microservices & events. Orkes Conductor is now available across major leading cloud platforms such as Azure, AWS & GCP.
If you are curious to learn more, please reach out to me, and I am happy to jump on a call to walk through in detail. Or join our Community Slack.