Scalable self-hosted runner system for GitHub actions

An updated step-by-step guide on how to deploy a system capable of creating self-hosted runners on demand!

12 min readMar 30, 2022

Hello everyone! This article is intended for organizations that develop on private repositories and the minutes available from GitHub are not enough for their CI/CD needs. It is just a compilation of the steps performed on the GitHub office hours workshop video. You can see the whole playlist here and the companion repository here. However, I wrote this article for those who do not want to watch the entire three videos, which are a bit lengthy. Just note that the platform being used is Microsoft Azure.

Just a beautiful city sunset before starting with the tutorial. Photo by Mohammed Nasim on Unsplash.

let’s begin!

Organize deployments

We are going to create a resource group called GitHubActionsRunners. This resource group will contain all the required Azure services to deploy our autoscaling solution for GitHub runners.

You should have Azure CLI installed, otherwise check here for how to do that. First, log into your azure account:

az login

Now, on your shell run the following commands:

az account set --subscription <subscription>

Create the resource group:

az group create --name GithubActionsRunners --location <region>

Create Azure kubernetes Cluster

As my GitHub actions workloads are not very resource intensive, I will stick to the predefined Standard_DS2node type for my worker pool. With this command we can create an Azure Kubernetes Services (AKS) cluster:

az aks create \
--resource-group GithubActionsRunners\
--name GithubActionsRunnersAKS \
--enable-addons monitoring \
--node-count 3 \
--node-vm-size Standard_DS2 \
--generate-ssh-keys

Connecting to AKS cluster

Enter the following command:

az aks get-credentials --resource-group GithubActionsRunners \
                       --name GithubActionsRunnersAKS

This will enable you to use kubectl commands on your AKS cluster. Make sure you have it installed. If not, check this out.

Provisioning the networking infrastructure

We need to create an Application Gateway, assign a public Ip and connect it with our AKS cluster. This way, we can create a webhook server to allow the controller to spin-up our future runners on-demand.

First, create a public IP service:

az network public-ip create \
               --resource-group GithubActionsRunners \
               --name APGWPublicIp \
               --allocation-method Static \
               --sku Standard

Then, create a vnet for the application gateway. This vnet will be different from that of the AKS cluster to avoid any overlap on the IP assignment.

az network vnet create --name appgwVNet \
                       --resource-group GithubActionsRunners \
                       --address-prefix 11.0.0.0/8 \
                       --subnet-name appgwSubnet \
                       --subnet-prefix 11.1.0.0/16

Now, create the application gateway itself:

az network application-gateway create \
--resource-group GithubActionsRunners \
--name GithubActionsRunnersAPGW \
--location <region> \
--sku Standard_v2 \
--public-ip-address APGWPublicIp \
--vnet-name appgwVNet \
subnet appgwSubnet

fetch its ID by running:

APPGW_ID=$(az network application-gateway show — resource-group GithubActionsRunners — name GithubActionsRunnersAPGW — query “id” — output tsv)

and attach it to the AKS cluster to allow ingress:

az aks enable-addons --resource-group GithubActionsRunners \
                     --name GithubActionsRunnersAKS \
                     --addons ingress-appgw \
                     --appgw-id $APPGW_ID

Furthermore, type the following commands to get the required information to create the peering in both directions of the gateway and the AKS.

NODERESOURCEGROUP=$(az aks show — name GithubActionsRunnersAKS — resource-group GithubActionsRunners — query “nodeResourceGroup” — output tsv)AKSVNETNAME=$(az network vnet list — resource-group $NODERESOURCEGROUP — query “[0].name” — output tsv)AKSVNETID=$(az network vnet show — name $AKSVNETNAME — resource-group $NODERESOURCEGROUP — query “id” — output tsv)APPGWVNETID=$(az network vnet show — name appgwVNet — resource-group GithubActionsRunners — query “id” — output tsv)

Application gateway to AKS peering:

az network vnet peering create --name AppGWtoAKSVnetPeering \
                    --resource-group GithubActionsRunners \
                    --vnet-name appgwVNet \
                    --remote-vnet $AKSVNETID \
                    --allow-vnet-access

AKS to application gateway peering:

az network vnet peering create --name AKStoAppGWVnetPeering \
                       --resource-group $NODERESOURCEGROUP \
                       --vnet-name $AKSVNETNAME \
                       --remote-vnet $APPGWVNETID \
                       --allow-vnet-access

Finally, specify a DNS name for the public IP address. Go to your public IP service in your Azure portal and inside configuration, give a DNS name like so:

Installing cert-manager on AKS cluster

According to the controller’s documentation, the system uses cert-manager for the certificate management of admission webhook. This means that we have to install it before configuring the controller on our cluster. Run the following:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.7.1/cert-manager.yaml -n cert-manager

This will install cert-manager under the cert-manager namespace. We should verify that the installation was successful by checking the deployments:

kubectl get pods --namespace cert-manager

Wait for a few seconds for the deployments to finish. We should have an output similar to this:

NAME                                       READY   STATUS    RESTARTS   AGE 
cert-manager-5c6866597-zw7kh               1/1     Running   0          2m 
cert-manager-cainjector-577f6d9fd7-tr77l   1/1     Running   0          2m 
cert-manager-webhook-787858fcdb-nlzsq      1/1     Running   0          2m

Furthermore, create an Issuer to test that the webhook works. Create a YAML called test-resources.yaml and paste this:

apiVersion: v1 
kind: Namespace 
metadata: 
  name: cert-manager-test 
--- 
apiVersion: cert-manager.io/v1 
kind: Issuer 
metadata: 
  name: test-selfsigned 
  namespace: cert-manager-test 
spec: 
  selfSigned: {} 
--- 
apiVersion: cert-manager.io/v1 
kind: Certificate 
metadata: 
  name: selfsigned-cert 
  namespace: cert-manager-test 
spec: 
  dnsNames: 
    - example.com 
  secretName: selfsigned-cert-tls 
  issuerRef: 
    name: test-selfsigned

Create the test resources:

kubectl apply -f test-resources.yaml

Now check the status of the newly created certificate:

kubectl describe certificate -n cert-manager-test

We should have an output similar to this:

... 
Spec: 
  Common Name:  example.com 
  Issuer Ref: 
    Name:       test-selfsigned 
  Secret Name:  selfsigned-cert-tls 
Status: 
  Conditions: 
    Last Transition Time:  2019-01-29T17:34:30Z 
    Message:               Certificate is up to date and has not expired 
    Reason:                Ready 
    Status:                True 
    Type:                  Ready 
  Not After:               2019-04-29T17:34:29Z 
Events: 
  Type    Reason      Age   From          Message 
  ----    ------      ----  ----          ------- 
  Normal  CertIssued  4s    cert-manager  Certificate issued successfully

Moreover, clean up the test resources:

kubectl delete -f test-resources.yaml

We have verified that cert-manager works!

Create ingress controller and production issuer

Now that the certificate manager works, we need to create a real Issuer and ingress controller. Create a YAML manifest called ingress-tls.yaml with the following content (Replace hosts with your hostname):

apiVersion: networking.k8s.io/v1 
kind: Ingress 
metadata: 
  name: ingress-main 
  annotations: 
    kubernetes.io/ingress.class: azure/application-gateway 
    cert-manager.io/cluster-issuer: letsencrypt-prod 
    appgw.ingress.kubernetes.io/ssl-redirect: "true" 
spec: 
  tls: 
    - hosts: 
      - <your-hostname-defined-in-APGWPublicIp>
      secretName: tlsingress 
  rules: 
  - host: <your-hostname-defined-in-APGWPublicIp>
    http: 
      paths: 
      - path: / 
        pathType: Prefix 
        backend: 
          service: 
            name: test-app 
            port: 
              number: 80

Now, let’s define a real Issuer by creating the following YAML, call it cluster-issuer-prod.yaml:

apiVersion: cert-manager.io/v1 
kind: ClusterIssuer 
metadata: 
  name: letsencrypt-prod 
spec: 
  acme: 
    server: https://acme-v02.api.letsencrypt.org/directory 
    email: <your-email> 
    privateKeySecretRef: 
      name: tlssprod 
    solvers: 
    - http01: 
        ingress: 
          name: ingress-main

Do not apply these resources just yet! Can you see that there is a test-app assigned to the route "\"? We will deploy a test app to ensure that the ingress is working correctly before moving on to installing the controller. Create a YAML called test-app.yaml and paste this into the file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: test-app
        image: mcr.microsoft.com/azuredocs/aks-helloworld:v1
        ports:
        - containerPort: 80
          protocol: TCP
        env:
        - name: TITLE
          value: "AKS Ingress test is successful!"
---
apiVersion: v1
kind: Service
metadata:
  name: test-app
  namespace: default
spec:
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  selector:
    app: test-app

deploy the test app on your Kubernetes:

kubectl apply -f test-app.yaml -n default

Then, apply the ingress controller and the issuer to AKS using:

kubectl apply -f ingress-tls.yaml
kubectl apply -f cluster-issuer-prod.yaml

Now, check the ingress:

kubectl get ingress

You should be able to see your hostname and public IP address. If you go to your browser and navigate to your hostname URL, you should get this screen:

This means that we are on the right track and we can move on to configure our webhook and controller!

Configuring GitHub App

This step needs to be done prior to installing the actions runner controller on your kubernetes, because there are some secrets needed prior to the installation. First, we need to install GitHub CLI using conda:

conda install gh --channel conda-forge

Now, let’s install a GitHub CLI extension for token management called GH Token. We need a dependency jq to be installed first:

wget https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
sudo chmod +x ./jq-linux64
mv ./jq-linux64 jq
sudo mv ./jq /usr/bin

With that set up, install the GH Token extension:

gh auth login
gh extensions install Link-/gh-token

We also need to install JWT-CLI, which is a command line tool that will help us work with JSON Web Tokens:

wget https://github.com/mike-engel/jwt-cli/releases/download/4.0.0/jwt-linux.tar.gz
tar xvf jwt-linux.tar.gz
sudo mv jwt /usr/bin/
rm -rf ./jwt-linux.tar.gz

Next, on GitHub, go to your Organization’s settings, select developer settings, then go to GitHub Apps, and click create a new GitHub App. Give it the following information:

Name for the App.
Homepage URL of your organization.
Uncheck Webhook just for now. We are going to enable that later.
Permissions
- Actions: Read-only
- Contents: Read-only
- Metadata: Read-only
- Self-hosted runners: Read and Write

Then select “Create GitHub App”. You will then be redirected to the App Settings and you should see a “Install App” tab on the left side. Select it, and press the button “Install”.

After the app is installed, in the “About” tab, copy the App Id:

Scroll down and generate a new private key:

Save it on a secure location in your computer, then use both the contents of the private key and the installation ID with the gh token command to get the installation ID:

gh token installations -i <App-ID> -k <path-to-private-key>

If you encounter a message like this:

{
  "message": "'Expiration time' claim ('exp') is too far in the future",
  "documentation_url": "https://docs.github.com/rest"
}

It is probably because of the difference between the GitHub server time and your local time. In that case, you might want to try this command:

sudo date +%T -s $(curl -sI https://api.github.com | grep -Fi "date" | awk '{ print $6 }') -u

and re-run gh tokenagain. Get the value from the "id"key in the JSON output. We are finally ready to install the actions-runner-controller!

Installing actions-runner-controller using HELM

Refer to this documentation on how to install HELM package manager. Use the following commands to get the actions-runner-controller repository.

helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controllerhelm repo update

We will need to createvalues.yaml manifest in which we are going to insert the three secrets obtained (App ID, Installation ID, and private key) and an additional random token as a webhook secret. It is a pretty long file so instead of pasting it in this article, you can copy the example file from the original repository here. I suggest you do CTRL+F after the file is copied and search for the following keys:

github_app_id
github_app_installation_id
github_app_private_key
github_webhook_secret_token

Then, insert the corresponding values next to those keys. For the additional github_webhook_secret_token key, I used a python library called secretsto generate a random hash value for it:

import secrets
# webhook_token variable contains the random token
webhook_token = secrets.token_urlsafe(16)
print(webhook_token) # copy the output value

Save the manifest. With all the parameters set-up, type the following to deploy the controller:

helm upgrade --install -f "path/to/values.yaml"\
             --namespace default\
             --create-namespace --wait \
             actions-runner-controller \
             "actions-runner-controller/actions-runner-controller"

Check that the deployments were successful:

kubectl get pods -n default

We should have the following output:

NAME                                                              READY   STATUS    RESTARTS   AGE 
actions-runner-controller-546dfcc8b8-wc7mt                        2/2     Running   0          20s 
actions-runner-controller-github-webhook-server-6bb9654dc8ckjgk   2/2     Running   0          20s 
test-app-65d7765fdd-6rff6                                         1/1     Running   0          16h

Updating the Ingress for our actions runner controller

Recall our ingress manifest called ingress-tls.yaml. Search for the httpkey. We want to insert the following path /runners-scaler right bellow the path /where our test app is:

- path: /runners-scaler 
  pathType: Prefix 
  backend: 
    service: 
      name: actions-runner-controller-github-webhook-server 
      port:  
        number: 80

Then, just update the ingress:

kubectl apply -f ingress-tls.yaml

Verify that everything is working correctly:

kubectl describe ingress ingress-main

On the output, search for Events and you should see something similar to this:

Events: 
  Type    Reason             Age   From          Message 
  ----    ------             ----  ----          ------- 
  Normal  CreateCertificate  14s   cert-manager  Successfully created Certificate "tlsingress"

Furthermore, use your browser or curl -G and insert your URL with the additional route that was just created, i.e: <your-url>/runners-scaler. It should return a message saying Webhook server is running. That means everything is ok! Eureka!

Adding the Webhook URL to the GitHub APP

Go back to the App Settings of your GitHub App. Under the “Webhook URL” textbox, insert the URL with the runner-scaler route. Right after that, there is a textbox called “Webhook secret”. Paste the hash that you have on your values.yamlfile that was generated using the previous python snippet.

Save the changes. Finally, on the “Permissions & Events” tab, scroll down to “Subscribe to events” and check:

Workflow job
Workflow dispatch
Workflow Run

Save your changes again. On the “Advanced” tab, you should see something like:

Which means that everything is good!

Spinning the runners

There are two ways the runners can be created. One is using deployments of kind Runnerand the other is using RunnerDeployment. On this tutorial, we are using RunnerDeployment as it will allow us to manage sets of runners. The Runner class only manages them individually.

RunnerDeployment

We need to create a YAML manifest for your deployment. Let’s call it basic-runners.yaml. Start by including the RunnerDeploymentkind:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: azure-runners
  namespace: default
spec:
  replicas: 0
  template:
    spec:
      organization: <your-organization>
      labels:
        - label1
        - label2
      env:
      - name: RUNNER_FEATURE_FLAG_EPHEMERAL
        value: "true"

Without autoscaling, we can manually define the number of runners to create by including the replicas key, inside the spec key.

HorizontalRunnerAutoscaler

Autoscaling is implemented by backing a RunnerDeployment resource kind with a HorizontalRunnerAutoscalerkind. With this, every time there is a webhook event, the runners can be created automatically and then destroyed when finished. We need to append the specifications for the autoscaler to the basic-runners.yamlmanifest:

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: azure-runners-deployment-autoscaler
  namespace: default
spec:
  scaleTargetRef:
    name: azure-runners
  minReplicas: 0
  maxReplicas: 6
  scaleUpTriggers:
  - githubEvent: {}
    duration: "1m"

and apply the resources:

kubectl apply -f basic-runners.yaml

With the above example, the webhook server scales example-runners by 1 replica, and destroys that runner 1 minute after it has finished its job.

By now you show be able to automatically create runners whenever there is a GitHub workflow queued on your actions tab. Only the jobs that match the labels that we set on basic-runners.yaml will be executed by our runners (you can see that in the YAML file, there are two labels called label1 and label2).

Testing the controller with a GitHub Actions workflow.

Note: You can also create customized runners if you need certain tools or dependencies installed before doing any job. But, that is a topic for another article.

If you made it through here, thank you for reading my post. I know there are lots of steps, but it is really worth considering if you need a great a scalable self-hosted runners solution for GitHub actions. Check out my other articles if you like.

See you on the next one!

hector6298 - Overview

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com