Improving ForgeOps Disk Performance and Security on AWS EKS

bradley.tarisznyas · April 12, 2022, 8:48pm

With the current push to Kubernetes, more and more ForgeRock customers have either deployed or are planning to deploy the ForgeRock platform on Kubernetes using ForgeOps (https://github.com/ForgeRock/forgeops).

Of these customers, most choose a managed Kubernetes service such as AWS Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE) or Azure Kubernetes Service (AKS) — which are supported by ForgeOps.

To support deployment to public cloud Kubernetes services, ForgeOps provides a number of cluster creation scripts to assist customers in setting up their clusters. These scripts are provided as a starting point for customers getting started with Kubernetes and ForgeOps, and can be updated to suit specific customer needs.

One of the improvements that should be considered if deploying ForgeOps to AWS EKS, is improvement of the underlying disk performance used for DS Persistent Volumes (PV).

The ForgeOps documentation has the following to say about limitations of DS in Kubernetes:

DS live data and logs should reside on fast disks.

DS data requires high performance, low latency disks. Use external volumes on solid-state drives (SSDs) for directory data when running in production. Do not use network file systems such as NFS.

For reference:

https://backstage.forgerock.com/docs/forgeops/7.1/rn/limitations.html#ds-limitations

This recommendation isn’t specific to Kubernetes deployments, but DS in general (https://backstage.forgerock.com/docs/ds/7.1/deployment-guide/prerequisites.html#size-io-storage).

Anyone familiar with cloud storage knows that not all disks are created equal, and most cloud providers provide a range of block storage options offering varying levels of performance and pricing. Volume selection must be carefully considered based on the workload, and for database type applications — IOPS (input/output operations per second) is critical for high performance.

Coming back to AWS EKS, and the ForgeOps cluster creation scripts (https://github.com/ForgeRock/forgeops/tree/master/cluster/eks), the cluster-up.sh script uses eksctl to create the cluster. In this script we can also see the setup of the StorageClass(s) that the deployment will use when creating the PVs for DS:

createStorageClasses() {
kubectl create -f - <<EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fast
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
EOF

# Set default storage class to 'fast'
kubectl patch storageclass fast -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Delete gp2 storage class
kubectl delete storageclass gp2
}

As we can see from this, the standard Kubernetes in-tree storage plugin is being used (kubernetes.io/aws-ebs) and that regardless of cluster size, the ‘standard’ and ‘fast’ storage classes use gp2 volumes.

For consistent high performance of DS, especially CTS (Core Token Service), this is a potential problem.

AWS gp2 EBS volume performance is based on the size of the volume, and is calculated at 3 IOPS/gb, with a minimum of 100 IOPS:

(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html)

AWS gp2 volumes are also ‘burstable’ — meaning that they can ‘burst’ up to 3000 IOPs for short periods of time as long as they have sufficient I/O credits. I/O credits are accumulated based on the size of the volume, the larger the volume, the quicker credits are accumulated. In the case of a gp2 volume depleting its I/O balance, it will only provide the base level of performance.

This means that the 100gb volumes specified in the small deployment (https://github.com/ForgeRock/forgeops/blob/master/kustomize/overlay/small/ds-cts.yaml) will be capable of bursting to 3000 IOPS, but once I/O credits are depleted, it will be constrained to the baseline of 300 IOPS.

For a 100gb gp2 volume operating at 3000 IOPS, the burst balance is typically sufficient for approximately 33 minutes at this level of performance before the balance depletes and is throttled to the base performance of the disk (300 IOPS).

What does this mean?

One of the most critical components of a ForgeRock deployment when it comes to performance is the CTS, especially when using a ‘stateful’ deployment where tokens are stored in CTS rather than on the client. CTS servers are typically disk bound — in high throughput deployments CTS servers are constantly reading and writing to disk as tokens are created, modified and deleted.

With the gp2 PVs used by the ForgeOps EKS deployment, this could create a critical bottleneck in peak times as I/O credits are used and the volume I/O is restricted back to baseline, heavily impacting the performance of the whole platform, and relying applications.

How do we ensure this doesn’t happen?

There are better options for DS servers than gp2 volumes. AWS offers the newer gp3 volumes that offer a consistent (non-burstable) 3000 IOPS baseline of performance at cheaper cost (https://aws.amazon.com/ebs/general-purpose/). They can also be configured with up to 16,000 IOPS (for extra cost) and performance isn’t tied to the size of the volume as it is with gp2.

AWS also offers provisioned IOPS volumes (io1 and io2) which can scale up to 64000 IOPS — catering to very high volume workloads.

As the Kubernetes in-tree storage plugin doesn’t support newer EBS volume types (such as gp3 or io2), we need to configure the AWS EBS CSI driver (https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) in the cluster, to give us access to the newer volume types.

We’ll run through this configuration next.

Configuring gp3 encrypted PVs for DS

As part of this exercise we are also going to be using encrypted EBS volumes for DS. This ensures data at rest is encrypted, and as encrypting backends in DS is currently not supported in the DS Docker images, we’ll achieve this by encrypting the EBS volume.

As mentioned the ForgeOps EKS cluster scripts use the kubernetes.io/aws-ebs in-tree storage plugin when creating storage classes, which doesn’t support the newer gp3 volumes. If you try to specify ‘gp3’ as the type, you will get an error in cluster creation saying the volumeType is invalid. In order to make use of gp3 volumes we need to use the AWS EBS CSI driver.

Once you have the cluster created, but before deploying any ForgeOps artefacts, we need to enable the CSI driver in the cluster and recreate the storage classes to use GP3 volumes.

1.Create the cluster using the cluster-up.sh script located in: <forgeops>/cluster/eks (eg ./cluster-up.sh small.yaml)

2. At this point you can verify the storage classes via kubectl, which shows the inline storage plugin (kubernetes.io/aws-ebs) and the gp2 type:

kubectl describe storageclass fast

3. Enable the AWS EBS CSI driver as outlined in https://docs.aws.amazon.com/eks/latest/userguide/managing-ebs-csi-self-managed-add-on.html. Follow up to step 3 of the section ‘To deploy the Amazon EBS CSI driver to an Amazon EKS cluster’, you can deploy the sample app if you wish (detailed in ‘To deploy a sample application and verify that the CSI driver is working’), but we’ll be deploying ForgeOps and validating that way later in this article.

4. Verify the driver is installed:

kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-ebs-csi-driver,app.kubernetes.io/instance=aws-ebs-csi-driver"
NAME                                  READY   STATUS    RESTARTS   AGE
ebs-csi-controller-56b5f5c7ff-2hrcb   5/5     Running   0          41h
ebs-csi-controller-56b5f5c7ff-jzqq6   5/5     Running   0          41h
ebs-csi-node-25z6q                    3/3     Running   0          23h

5. Delete the ‘standard’ and ‘fast’ storage classes created by the cluster-up.sh script:

kubectl delete storageclass standardkubectl delete storageclass fast

6. Recreate the storage classes using the AWS EBS CSI driver and gp3 encrypted volumes:

kubectl create -f - <<EOFkind: StorageClassapiVersion: storage.k8s.io/v1metadata:name: fastprovisioner: ebs.csi.aws.comparameters:**type: gp3****encrypted: "true"**---kind: StorageClassapiVersion: storage.k8s.io/v1metadata:name: standardprovisioner: ebs.csi.aws.comparameters:**type: gp3****encrypted: "true"**EOF

7. Describing the fast storage class now shows:

kubectl describe storageclass fast

Now when we create DS pods with the ‘fast’ storage class, we will get encrypted gp3 volumes.

Now go ahead and deploy the CDM according to: https://backstage.forgerock.com/docs/forgeops/7.1/cdm/overview.html

Once deployed, let’s verify our DS instances are using the new storage classes using:

kubectl get pv
kubectl describe pv <pv_name>

This confirms we have a 20gb ‘fast’ PV, how do we know it’s gp3?

Look at the source.VolumeHandle property in the output of the ‘describe pv’ command, this is the AWS volume id, we can then get it via the AWS CLI:

aws ec2 describe-volumes — volume-ids <volume_id>

which will show the encrypted gp3 volume with 3000 (non-burstable) IOPS!

A final word…

ForgeRock DS ships with handy utilities that allow you to benchmark performance of DS servers (https://backstage.forgerock.com/docs/ds/7.1/getting-started/performance.html) — these tools should form the basis of a ‘bottoms up’ approach to performance testing to ensure the storage layer is performing optimally. Combine these tools with dstat or iostat (if you have them in your base DS image) to ensure the underlying disk storage is not a cause of bottlenecks.

Improving ForgeOps Disk Performance and Security on AWS EKS

Other Articles by This Author