The Embarrassing State of Enterprise ACME Support
TL;DR ACME is more than just the protocol used by Let's Encrypt for public web TLS certificates. It can be perfect for internal TLS endpoints in the enterprise. Unfortunately, a lot of enterprise software doesn't support ACME natively, or it only supports Let's Encrypt. And, we encountered some challenges getting open source ACME clients to work with internal CAs.
We want your take on the status of internal certificates. Take our survey.
Let's go back to the 1990s IT scene. Picture the server room, with racks of gear and a CRT monitor on a little cart. One of the tasks of the DevOps person ("system administrator") was to renew the TLS certificates ("SSL certificates") on their load balancers and web servers every year or two.
Because it was such a rare occasion, no one bothered to automate the process. Instead, the system administrator would configure their $5,000 rack-mounted, SSL-terminating Enterprise Load Balancing Appliance (with Integrated Space Heater) to send out notification emails at 90, 45, 30, 7, 2, and 1 days before its SSL certificate was to expire.
One day they would open up Eudora and see the 90 day notification email. And a feeling of dread would arise because they knew they were in for the maddening task of renewing the stupid SSL certificate. The single key that secures almost all external traffic with everyone, everywhere.
Knowing how unsavory this chore was, they would punt. And somewhere around 7 days before expiry, with the pressure mounting, they'd finally get serious: They would search Yahoo! and find a Slashdot thread explaining how to create a Certificate Signing Request (CSR) file using OpenSSL. Then they would upload the CSR on Verisign's website to get it signed, click on a link in a verification email, send Verisign a stack of cash, download their newly-minted certificate, and finally—blessedly—import it, and the private key, onto the appliance.
At this point a toast was in order, because disaster had been averted for one more year.
There were so many problems with this approach. The alert emails were easily lost and could get inadvertently disabled by infrastructure upgrades. When the time came to renew the certificate, no one could remember the magical
openssl incantation that creates the proper CSR and private key in the right file formats. And because renewal was such a hassle, it incentivized having the longest duration certificate possible.
By the early 2000s, Certificate Lifecycle Management became a bigger area of concern in the enterprise world, a Concern Worthly of Capitalization. And protocols emerged to facilitate Certificate Management. Simple Certificate Enrollment Protocol (SCEP) [rfc8894]—and, more recently, Enrollment Over Secure Transport (EST) [rfc7030]—emerged and evolved through the 2000s and 2010s. SCEP and EST use shared secrets or certificates to authenticate and fulfill CSRs. You can think of EST as, roughly, SCEP with TLS. Both protocols are still widely in use today, along with some other more obscure protocols.
Here's an example of where EST shines: Let's say you need long-lived certificates injected into IoT devices on your manufacturing line. For this kind of enrollment via EST, you can configure EST on a root CA to accept CSRs that are signed by a different CA that sits on the manufacturing line. Unfortunately, because SCEP and EST work with all sorts of X.509 certificates (eg. for VPNs, people, WiFi authentication, etc.), they aren't great for the specific challenge of issuing TLS certificates for domain names in the Web PKI. So, SCEP and EST never really solved the original problem presented in this story, and over time people developed all sorts of ad hoc methods for renewing their web certificates.
Fast forward to 2020. A few things have changed:
- No one uses Eudora to read their email anymore.
- The Automatic Certificate Management Environment (ACME) protocol [rfc8555] (used by Let's Encrypt) was created in 2016 and it has taken the Web PKI by storm, offering free TLS certificates for any domain name.
Smallstep is an annual sponsor of Let's Encrypt, and we're donating Let's Encrypt swag to a few lucky winners! See below.
- The use of HTTPS grew from 30% of webpages, to 84% in just four years. Automation FTW!
- The 90-day validity period of free TLS certificates has forced everyone to consider automation, and it has improved key hygiene.
- Yet, some enterprise software still has a certificate expiry email reminder feature!
- We started calling system administration "DevOps" for some reason.
So, does this mean we've finally solved the original enterprise TLS certificate problem from the 1990s? Well...it's complicated.
Before we get into the details of ACME in the enterprise, here's a quick overview of how ACME works. ACME is a JSON API that runs mostly over HTTPS. To get a certificate issued by an ACME server, a client must prove that it controls the requested domain name(s). It does this by responding to ACME challenges from the server.
A typical ACME challenge flow looks like this:
- The ACME client generates a Certificate Signing Request (CSR) and a private key. It contacts the ACME server and requests a certificate for the intended domain name.
- To verify that the client owns the domain name, the ACME server responds with one or more challenges. The challenges are just random values. There are three challenge types that the client can use to authenticate its CSR with the CA:
http-01— the client places the challenge value at a well-known URL on an HTTP server at a domain named in the certificate request.
dns-01— the client creates a DNS
TXTrecord that matches the challenge value, confirming that the client has control over DNS for a domain named in the certificate request.
tls-alpn-01— the client adds the challenge value to the initial TLS handshake (using the Application-Layer Protocol Negotiation (ALPN) TLS extension) of a server answering at a domain named in the certificate request.
Once challenges have been met for each DNS name listed on the certificate, the client can retrieve its signed certificate from the server.
ACME isn't just for use in the Web PKI. In fact, it's perfect for internal CAs. And it's sorely needed. Because, 25 years after the invention of SSL, most enterprise environments still use a mess of ad hoc systems to solve their TLS certificate management problems.
I surveyed the most popular hardware and software load balancers in 2020. All of them offer their own custom APIs for certificate management, and none of them have built-in ACME support for internal CAs:
|F5 BIG-IP||No native ACME support|
|Citrix ADC||No native ACME support|
|Kemp||No native ACME support|
|Barracuda WAF||Hardcoded to Let's Encrypt|
|Oracle Load Balancer||No native ACME support|
|NGINX Plus||No native ACME support|
|Zevenet||Ships with certbot + some glue code|
|pfSense||Hardcoded to Let's Encrypt|
|Cisco Expressway-E||Hardcoded to Let's Encrypt|
|cPanel||Hardcoded to Let's Encrypt or Sectigo|
I didn't include AWS, Google, and Cloudflare load balancers because those companies provide automated public certificate management using their own CAs. Coupled with an external server and some glue code, it's possible to use ACME with any of these products. But, having to set up and maintain an external server running an ACME client like
certbot.sh, just to get automated certificate renewals, is not ideal. ACME works best when the ACME client is built right into the service using the certificate. This minimizes the movement of private keys, allows certificates to be replaced without service interruptions, and keeps maintenance simple.
Now let's take a look at some common ACME clients and see how well they support an internal CA.
An enterprise-grade ACME client needs have the following bare minimum features:
- Allow the user to supply an internal ACME CA URL.
- Make it easy for the ACME client to trust the internal CA's root certificate.
- Allow for a configurable fallback CA URL, just in case.
- Don't assume a 90 day certificate lifetime. Internal CAs often issue short-lived (24 hour or less) certificates. For automated renewal, renew the certificate when it has reached 2/3 of its lifetime, not 60 days.
dns-01challenges, don't hardcode public DNS server IPs or assume public DNS propagation. Allow the DNS resolver and propagation timer length to be configured. The
dns-01challenge type only requires that the CA and client resolvers share common DNS servers; it doesn't depend on public DNS propagation.
- Finally, don't assume the ACME CA is using the same trust store as your DNS APIs. ACME CA may have an internally-issued certificate while DNS APIs are usually part of cloud APIs that use Web PKI certificates.
Here are some nice-to-have features:
- Support for ECDSA certificate chains. ECDSA chains are common for internal CAs, and Let's Encrypt is moving toward ECDSA for the Web PKI.
- Support ACME External Account Binding (EAB).
- Support ACME verifications on IP addresses [rfc8738]. As an enterprise user of
step-cawrote: "It would be useful to get rid of certificate errors for sites when accessed via their local ip address. In our organization, we have very little standardization of hostnames (mostly due to acquisitions of other companies and systems) so often we just remember the IPs."
Further Reading: The excellent Best Practices for ACME Client Operations (GitHub repo). This document covers a much broader range of ACME use cases, both public and private, filling in some of the gaps between the RFC and the real world. While doing research around our new ACME registration authority, I set up a private ACME CA and spent a couple of days getting various ACME clients to enroll and renew certificates using my CA and a private DNS zone in Google Cloud DNS. My CA uses an ECDSA chain. Here's what I found:
I've included both native ACME issuers, like Certbot, and some services that support ACME natively, like Caddy.
|Kubernetes cert-manager ACME issuer||I spent a day rebuilding cert-manager to get it to trust my CA. Worked great after that.|
|Certbot, using http-01 challenge||Certbot defaults to renewing certificates 30 days before they expire rather than looking at the validity period of the certificate; I had to manually change this to 8 hours|
|Certbot, using dns-01 challenge||Once I got it to trust my CA, it stopped trusting Google's DNS API and crashed.|
|Lego CLI||Works great|
|Traefik (using Lego library)||Works great|
|Terraform ACME issuer||Renewal depends on running a plan or apply; not a good fit for short-lived certs.|
|Apache mod_md||Does not support ECDSA chains|
|HAProxy||Does not support ECDSA chains|
Unfortunately, enterprise support for the ACME protocol, even in ACME clients, is still underdeveloped.
Given all of the ACME adoption in Web PKI, it seems inevitable that it will be used more internally. But we've got a long ways to go before certificate management with ACME in the enterprise is fully supported.
What's your take on the status of internal certificates and the ACME protocol? Share your experience and opinions! It's 10 quick questions and we plan to share the summary results with the community in an upcoming blog post so let your voice be heard.
Carl Tashian (Website, LinkedIn) is an engineer, writer, exec coach, and startup all-rounder. He's currently an Offroad Engineer at Smallstep. He co-founded and built the engineering team at Trove, and he wrote the code that opens your Zipcar. He lives in San Francisco with his wife Siobhan and he loves to play the modular synthesizer 🎛️🎚️