DeployU
Interviews / DevOps & Cloud Infrastructure / GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.

GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.

practical Self-Hosted Runners Interactive Quiz Code Examples

The Scenario

Your organization needs more than GitHub-hosted runners can provide:

Requirements:
- GPU access for ML model training
- Access to private network resources
- Larger machines (64GB RAM, 32 cores)
- Custom software pre-installed
- Compliance: builds must run in our data center
- Cost: $50k/month on GitHub-hosted runners

The Challenge

Design and implement a self-hosted runner infrastructure that’s secure, scalable, and cost-effective while meeting all requirements.

Wrong Approach

A junior engineer might just install runners on a few VMs, use the same runner for all jobs, run runners as root, or skip security hardening. These approaches create security vulnerabilities, don't scale, and make maintenance difficult.

Right Approach

A senior engineer implements ephemeral runners using Kubernetes or auto-scaling groups, isolates workloads, implements proper security controls, and automates runner lifecycle management.

Step 1: Choose the Right Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Self-Hosted Runner Architecture               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Option 1: Kubernetes (Actions Runner Controller)               │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Kubernetes Cluster                                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │ Runner Pod  │  │ Runner Pod  │  │ Runner Pod  │       │  │
│  │  │ (ephemeral) │  │ (ephemeral) │  │ (ephemeral) │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  │           ▲                                               │  │
│  │           │ Scales based on pending jobs                 │  │
│  │  ┌────────┴────────┐                                     │  │
│  │  │ ARC Controller  │                                     │  │
│  │  └─────────────────┘                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                  │
│  Option 2: VM Auto-Scaling (AWS/GCP/Azure)                      │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Auto Scaling Group                                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │ Runner VM   │  │ Runner VM   │  │ Runner VM   │       │  │
│  │  │ (ephemeral) │  │ (ephemeral) │  │ (ephemeral) │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  │           ▲                                               │  │
│  │           │ Webhook triggers scaling                     │  │
│  │  ┌────────┴────────┐                                     │  │
│  │  │ Webhook Service │                                     │  │
│  │  └─────────────────┘                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 2: Deploy Actions Runner Controller (ARC)

# Install ARC using Helm
# helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
# helm install arc actions-runner-controller/actions-runner-controller -n arc-system

# runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: org-runners
  namespace: arc-runners
spec:
  replicas: 2  # Minimum runners
  template:
    spec:
      organization: my-org
      labels:
        - self-hosted
        - linux
        - x64
      # Ephemeral - new runner for each job
      ephemeral: true
      # Resource requests
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      # Docker-in-Docker for container builds
      dockerdWithinRunnerContainer: true
      # Custom image with pre-installed tools
      image: ghcr.io/my-org/custom-runner:latest
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
---
# Horizontal Runner Autoscaler
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: org-runners-autoscaler
  namespace: arc-runners
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: org-runners
  minReplicas: 1
  maxReplicas: 20
  # Scale based on workflow job queue
  scaleUpTriggers:
    - duration: "2m"
      amount: 1
  scaleDownDelaySecondsAfterScaleOut: 300
  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.75"
      scaleDownThreshold: "0.25"
      scaleUpFactor: "2"
      scaleDownFactor: "0.5"

Step 3: Create Custom Runner Image

# Dockerfile for custom runner
FROM ghcr.io/actions/actions-runner:latest

# Install as root for system packages
USER root

# Install common dependencies
RUN apt-get update && apt-get install -y \
    curl \
    wget \
    git \
    jq \
    unzip \
    docker.io \
    python3 \
    python3-pip \
    nodejs \
    npm \
    && rm -rf /var/lib/apt/lists/*

# Install specific tools
RUN curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubectl" \
    && chmod +x kubectl \
    && mv kubectl /usr/local/bin/

RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install \
    && rm -rf aws awscliv2.zip

# Install Terraform
RUN wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg \
    && echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/hashicorp.list \
    && apt-get update && apt-get install -y terraform

# Switch back to runner user
USER runner

# Pre-cache common actions
RUN mkdir -p /home/runner/_work/_actions

Step 4: Runner Groups for Access Control

# Create runner groups in GitHub Organization Settings
# Settings > Actions > Runner groups

# Using API to manage runner groups
name: Setup Runner Groups

on:
  workflow_dispatch:

jobs:
  setup:
    runs-on: ubuntu-latest
    steps:
      - name: Create runner groups
        env:
          GH_TOKEN: ${{ secrets.ADMIN_TOKEN }}
        run: |
          # Create production runner group
          gh api -X POST /orgs/my-org/actions/runner-groups \
            -f name="production-runners" \
            -f visibility="selected" \
            -f selected_repository_ids="[123,456]" \
            -f allows_public_repositories=false

          # Create general runner group
          gh api -X POST /orgs/my-org/actions/runner-groups \
            -f name="general-runners" \
            -f visibility="all" \
            -f allows_public_repositories=false

Step 5: GPU Runner Configuration

# gpu-runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: gpu-runners
  namespace: arc-runners
spec:
  replicas: 0  # Scale from 0
  template:
    spec:
      organization: my-org
      group: ml-runners
      labels:
        - self-hosted
        - linux
        - gpu
        - nvidia
      ephemeral: true
      # GPU node selector
      nodeSelector:
        nvidia.com/gpu: "true"
      # Request GPU resources
      resources:
        requests:
          nvidia.com/gpu: 1
          cpu: "8"
          memory: "32Gi"
        limits:
          nvidia.com/gpu: 1
          cpu: "16"
          memory: "64Gi"
      # Custom GPU-enabled image
      image: ghcr.io/my-org/gpu-runner:latest
      # Volume for model cache
      volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
# Workflow using GPU runner
name: Train Model

on:
  push:
    paths:
      - 'ml/**'

jobs:
  train:
    runs-on: [self-hosted, gpu, nvidia]
    steps:
      - uses: actions/checkout@v4

      - name: Check GPU
        run: nvidia-smi

      - name: Train model
        run: python ml/train.py --use-gpu

Step 6: Security Hardening

# Secure runner deployment
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: secure-runners
spec:
  template:
    spec:
      organization: my-org
      labels:
        - self-hosted
        - secure
      # Always ephemeral for security
      ephemeral: true
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      # Container security
      containerSecurityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: false  # Runner needs to write
        capabilities:
          drop:
            - ALL
      # Network policy (apply separately)
      # Limit egress to required endpoints only
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-runners
spec:
  podSelector:
    matchLabels:
      app: runner
  policyTypes:
    - Egress
    - Ingress
  egress:
    # Allow GitHub API
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - protocol: TCP
          port: 443
    # Allow DNS
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
  ingress: []  # No inbound traffic needed

Step 7: Monitoring and Alerting

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arc-metrics
  namespace: arc-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: actions-runner-controller
  endpoints:
    - port: metrics
      interval: 30s
---
# Grafana Dashboard (JSON model)
# Alert rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: arc-alerts
spec:
  groups:
    - name: arc
      rules:
        - alert: RunnerQueueBacklog
          expr: github_actions_runner_job_queue_length > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Runner job queue is backing up"

        - alert: NoAvailableRunners
          expr: github_actions_runner_available == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "No runners available"

Runner Type Comparison

AspectGitHub-HostedSelf-Hosted (VM)Self-Hosted (K8s)
Cost$0.008/min~$0.002/min~$0.001/min
SetupNoneMediumComplex
MaintenanceNoneMediumLow (with ARC)
CustomizationLimitedFullFull
ScalingAutomaticManual/WebhookAutomatic
SecurityGitHub managedYour responsibilityYour responsibility
Network accessPublic onlyPrivate + PublicPrivate + Public

Practice Question

Why should self-hosted runners be configured as ephemeral for security-sensitive workloads?