Never DIY Your PKI

jhh-headshot.jpg

J. Hunter Hawke

Follow Smallstep

I never want to build my own PKI ever again if I can help it. A good PKI is essential for most organizations’ security models, and it’s important that your PKI is flexible and scalable for your future needs. But building one from scratch is much easier said than done.

When I built my own PKI, it turned into the biggest yak-shaving task I’ve ever had. By the conclusion of the project, I’d inevitably introduced more complexity and overhead into my cloud environment, but this was the cost of a good security posture—or so I’d thought. To be fair, this was probably the best solution at the time; it just would’ve been nice to have today’s alternatives instead.

yak-shaving.jpg

Source: David Revoy, CC

Background

My first job out of college was in cyber security for the IoT machine health company, KCF Technologies. I’d interned there during school and was hired as a Cybersecurity Analyst before stepping up as Information Security Architect.

My team was responsible for maintaining the security of our IoT devices and their network connections. This was an incredible learning experience for me, as the problems we were solving were challenging and required some real creativity.

One morning, my manager told me that our Hardware Engineering department was working on a prototype of our next IoT device. This new iteration was to be sleeker, faster, and more secure than former models. We were to collaborate with Hardware Engineering on modernizing and hardening the overall security of the device, from kernel to casing.

Historically, our devices used password-based authentication across the board. These passwords were cumbersome and difficult to rotate, and they were an outdated mode of authentication for the infrastructure we needed. We’d discussed building out password databases and automating rotation, but the supporting internal processes quickly turned into a spiderweb of complexity. Certificate-based authentication became the best—albeit scariest—alternative, and I was responsible for finding a Public Key Infrastructure (PKI) solution that fit our use case.

Searching for a path forward

Where to start? I dug around for managed PKI vendors. My team set up a few demo calls, and the vendor’s answer was always the same: every certificate was going to cost us $10-$20 a pop—far outside of the project budget. We had thousands of IoT devices in the field, so the cost per certificate had to be an order of magnitude smaller. It quickly became apparent that the managed PKI space was far too costly and that we’d need to build something ourselves. (This was before Certificate Manager existed as a product.)

I also looked at a DIY approach, using OpenSSL and Bash scripts. It took a half hour of researching to convince me that was the wrong road. I’m especially impressed by anyone who has successfully done so, but I have too much self-respect to cobble together a PKI.

Requirements slowly became clearer, and I needed an open source option that would:

  1. Enable the team to automate certificate issuance and renewal,
  2. Provide multiple strong authentication types,
  3. Log all user/system actions and requests,
  4. Maintain a record of all host certificates, and
  5. Function as a highly available production deployment.

In short, I needed a holistic machine identity platform. This was a lot to ask of an open source project; but to my surprise, I found a handful of projects that looked impressive. Not only that, but there were even some being managed by actual software companies.

I spent a week researching and tinkering with various projects. Many of these options would require lots of duct tape to piece together the features we needed. Then I found step-ca and Smallstep.

The first thing that I noticed about step-ca was its absolute mountain of supporting documentation; it was far more than I could have hoped. step-ca also had a sister project, step CLI, that did all the command mapping and half of the automation for me. No need to write scripts to hit a series of API, and no need to use arcane OpenSSL commands to inspect and manage certificates.

Sold. I committed myself to using step-ca to secure our fleet of IoT devices.

More than a brief foray into step-ca

In retrospect, my biggest mistake was not dedicating my first week or two to reading every piece of documentation for the step toolchain. I’d learned about PKI in school, but I’d never designed and administered one. There were so many questions. What hierarchy structure would best accommodate our future needs? How should we store our CA keys? How can we have fail-safe connectivity to our entire fleet of devices? What about access control? Logging? Those were just the questions I knew I had. What else didn’t I know?

I like to learn by doing, so I dug into the software. I stood up an EC2 instance, installed the step toolchain, ran step ca init, and then step-ca ca.json to get my very first CA running—and that was all I needed to do. I switched to a local terminal to request a certificate, and sure enough, step ca certificate returned an X.509 certificate with all the settings I’d requested. That was so easy!

Now I only had a million other questions to answer.

The step CLI quickly became my favorite tool. I was blown away by how straightforward the most common operations were. The commands were intuitive and didn’t require a dictionary to construct.

We immediately replaced OpenSSL with step CLI. It did everything we needed from OpenSSL, and more. It enabled us to automate certificate issuance and renewal, inspect and lint certificates, interface with system trust stores, and utilize a slew of step crypto commands. The step ecosystem became a game-changer for us.

I began to wonder: how hard would it be to set up an OpenVPN server for our devices that used client/server certificate authentication? Fortunately for me, Carl Tashian had just written a blog post announcing X.509 certificate flexibility, that contained an OpenVPN tutorial. I got it working in less than a half hour—with SSO from our identity provider!

By this point I was sold. My proof of concept was successful, I had a centralized authority that could automate certificate issuance for an arbitrary OpenVPN server, and all that our operators had to do to connect was use SSO to request a client certificate. It was obvious that step-ca was light and customizable enough to fashion a holistic solution.

Perhaps we could also use it to secure SSH…

Why had I never heard of SSH certificates before?

not-helping-xkcd.png

Source: XKCD

Next, I wanted to address SSH authentication to our IoT devices.

SSH traditionally has two methods of authentication: password-based, and SSH keys. Neither is ideal because you still have to rotate your secrets regularly—which is an operational nightmare. And, for us, it was impossible to distribute SSH keys in a manner that didn’t leave thousands of credentials scattered all over the place.

Around this time, I found Carl’s blog post on SSO for SSH. I’d never heard of SSH certificates before, and the concept seemed too good to be true. The idea of issuing short-lived certificates as a means of authentication was an appealing alternative to what we were currently using. All I had to do was generate two SSH key pairs, slap them in the CA, and start using step to sign SSH certificates—or so I’d thought.

It was so easy to issue X.509 certificates to my endpoints, and I could simply use the same provisioners to issue SSH certificates to our devices as well. I signed host SSH certificates for my endpoints, set up my ssh-agent, requested a user certificate, entered my SSH command, and before I knew it…

hunter > ssh hunter@hostname.com
Permission denied (publickey)

It was about time I hit a hiccup with the project, so I dug further. As it turns out, my hosts weren’t set up to trust users, and my users weren’t set up to trust hosts. Consulting some examples I found, I added a few lines to sshd_config on my hosts and added the public SSH CA key to my known_hosts file. I tried again…

hunter > ssh hunter@hostname.com
Permission denied (publickey).

I found more issues by looking into the detailed SSH and host-side logs. Fixing those revealed additional errors that were hiding. I proceeded to spend days searching Google, Reddit, and the openSSH Cookbook for potential solutions with no luck. So I took to GitHub Discussions, and trusty Carl (Thank you, Carl) helped me troubleshoot my final errors and point me in the right direction.

I’d wasted a painful amount of time trying to get SSH certificates to work, but I finally had a process that worked consistently and that we could script into our device-provisioning process. My PoC was a huge success, and I got the green light to build the production system that would serve as our machine identity PKI.

Building the damned thing

Now that I had a great PoC working, I thought the hardest part of the project was over. I’d managed to find the perfect software to support the project, and in building my PoC I felt I’d hit every possible complication, error, and misconfiguration. But there were so many more challenges ahead. I won’t reveal the details of how my architecture was structured or what the security model looked like, but here are some examples.

The biggest problem was that step-ca is a single-tenant CA that doesn’t inherently provide high availability (HA). If there were an outage, I’d be stuck with a few thousand devices and users that couldn’t renew or issue certificates—unacceptable.

The problem of HA posed a huge issue for me since, at the time, neither Remote Provisioner Management nor the step-kms-plugin was part of the step-ca platform. So as another challenge, I had to come up with a way to synchronize my CA configuration across nodes and upload a copy of my signing keys to each computer resource in use.

Ultimately, I cobbled together networking and data layers using Terraform to structure the AWS microservices and scripts I used. Check out Smallstep’s case study on KCF Technologies for further details.

In total, the time from my discovery of step-ca to having a completed Terraform module in hand was about six months. The project was the most intensive and unique project I had ever worked on, as step-ca required a decent knowledge of PKI as a prerequisite. I had to learn from scratch how to design, manage, grow, and maintain a custom PKI. I had pored over Smallstep’s blogs, docs, and forums to backfill my lack of understanding in the domain. The project required much yak shaving and many failed attempts to cobble together a resilient solution.

Putting the Smallstep module to work

I finally had a Terraform module to fit an MVP of the project requirements: HA, scalable, segmented, secured. After only six months, the Hardware Team finally had a dedicated CA, and they were unblocked from the security part of the project.

But my work wasn’t done. There were still many manual steps involved in operating the CA. Now that the Hardware Team was unblocked, it was time to iterate my setup for the long-term. The requirements were also growing; the Hardware Team wanted two more CAs!

Behind the scenes, I had Lambdas standing up containers and dropping in the configuration files from an S3 bucket. The sensitive info in my configuration was also populated by Lambdas, but the provisioners themselves existed on the base configuration file. Any and all provisioner operations had to be performed directly against a file sitting in an S3 bucket, and someone had to manually press a button to apply changes (if you didn’t want to wait for the Lambdas to detect a change). With only one authority, this was fine. With three, it was a headache.

I created a new Epic: “Build out the quality-of-life pieces for our CAs.” Realistically, this meant: “Finish building the Terraform module for real this time.”

infinite-problems-xkcd.png

Source: XKCD

This part of the project was painful. It was one of those never-ending tickets that permanently lived in the “In Progress” column of the team’s Kanban board. Smallstep’s engineers and open source community continuously cranked out new features that added new functionality to step-ca and addressed some hoops I’d jumped through to get my design to work. This made it difficult to change the architecture of my running authorities to both adopt new features and backfill the features/automation I still needed. Nevertheless, “perfection” is the enemy of “done,” so it made sense to start with an MVP.

Managing our CAs became a large part of my job, and it was definitely tedious. Nine months in, if I could have burned my Terraform module to the ground and redesigned everything, I would have in a heartbeat. There were new KMS features that could change the way we were signing keys, and we needed other automations that I hadn’t considered. However, I imagine I’d still feel the same way now even if I had; there’s simply too much functionality that has changed and grown in the step-ca project…champagne problems.

Why I never want to build another PKI

The truth is, I never want to build my own PKI ever again because I never again want to deal with the overhead and yak shaving it involves.

I discovered step-ca is a brilliant and elegant solution to massive problems we regularly experience in network and infrastructure security. I highly, highly recommend to all of my colleagues that they at least take time to play with step-ca and learn something useful about PKI and its applications. It is a phenomenal, educational tool, perfect for small, one-off projects. That said, I don’t recommend taking the time and resources to build out step-ca to service an entire production system. It’s just too much.

I joined Smallstep as a Security Solutions Architect a little over a year ago because I love the step ecosystem. It’s my favorite open source project, and I love that step-ca can be used to leverage security best-practices by anyone in any part of the world. However, I wish Smallstep Certificate Manager had been Generally Available when I started my PKI journey at KCF Technologies. More than six months of my time and pay went into designing and building out a Certificate Authority our devices could use. I devoted even more to keeping the project alive and updated. Imagine the money and time we could have saved for the organization if we were able to get up-and-running with a resilient CA and a dedicated support team in a matter of days instead of months.

I also think of the features that Certificate Manager offers, but that open source doesn’t. A web UI, endpoint management/visibility, centralized logging, user session tracking, webhooks, and email notifications are all features I wish that I’d had for my PKI. Instead, I had to query through loads of garbage in CloudWatch, and Hardware Engineering had to create a telemetry service to monitor their issued certificates.

One of my greatest joys at Smallstep is helping our customers get started with our software and automate certificate issuance all over their infrastructure. Users start out on day one with an unlimited number of highly available certificate authorities and direct access to our team of professionals. I did really enjoy learning about PKI and tinkering with step-ca. I think it is such a fun and powerful tool. But if you’re looking for a production-grade certificate authority, come and talk to us, so we can get you moving in the right direction, beginning on day one.

I recognize my bias, but I recommend managed Certificate Manager or its partner solution SSH Professional for any production workload that needs certificates.

Hunter Hawke (LinkedIn, Twitter has a background in Security Operations and IoT Security Architecture. Since joining Smallstep as a Security Solutions Architect, his primary focus is helping customers secure their systems with the step toolchain.