GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.

Q: GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.

Learn the answer to "GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your organization needs more than GitHub-hosted runners can provide:

Requirements:
- GPU access for ML model training
- Access to private network resources
- Larger machines (64GB RAM, 32 cores)
- Custom software pre-installed
- Compliance: builds must run in our data center
- Cost: $50k/month on GitHub-hosted runners

The Challenge

Design and implement a self-hosted runner infrastructure that’s secure, scalable, and cost-effective while meeting all requirements.

Wrong Approach

A junior engineer might just install runners on a few VMs, use the same runner for all jobs, run runners as root, or skip security hardening. These approaches create security vulnerabilities, don't scale, and make maintenance difficult.

Addresses symptoms, not root cause

Right Approach

A senior engineer implements ephemeral runners using Kubernetes or auto-scaling groups, isolates workloads, implements proper security controls, and automates runner lifecycle management.

Step 1: Choose the Right Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Self-Hosted Runner Architecture               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Option 1: Kubernetes (Actions Runner Controller)               │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Kubernetes Cluster                                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │ Runner Pod  │  │ Runner Pod  │  │ Runner Pod  │       │  │
│  │  │ (ephemeral) │  │ (ephemeral) │  │ (ephemeral) │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  │           ▲                                               │  │
│  │           │ Scales based on pending jobs                 │  │
│  │  ┌────────┴────────┐                                     │  │
│  │  │ ARC Controller  │                                     │  │
│  │  └─────────────────┘                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                  │
│  Option 2: VM Auto-Scaling (AWS/GCP/Azure)                      │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Auto Scaling Group                                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │ Runner VM   │  │ Runner VM   │  │ Runner VM   │       │  │
│  │  │ (ephemeral) │  │ (ephemeral) │  │ (ephemeral) │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  │           ▲                                               │  │
│  │           │ Webhook triggers scaling                     │  │
│  │  ┌────────┴────────┐                                     │  │
│  │  │ Webhook Service │                                     │  │
│  │  └─────────────────┘                                     │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 2: Deploy Actions Runner Controller (ARC)

# Install ARC using Helm
# helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
# helm install arc actions-runner-controller/actions-runner-controller -n arc-system

# runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: org-runners
  namespace: arc-runners
spec:
  replicas: 2  # Minimum runners
  template:
    spec:
      organization: my-org
      labels:
        - self-hosted
        - linux
        - x64
      # Ephemeral - new runner for each job
      ephemeral: true
      # Resource requests
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      # Docker-in-Docker for container builds
      dockerdWithinRunnerContainer: true
      # Custom image with pre-installed tools
      image: ghcr.io/my-org/custom-runner:latest
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
---
# Horizontal Runner Autoscaler
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: org-runners-autoscaler
  namespace: arc-runners
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: org-runners
  minReplicas: 1
  maxReplicas: 20
  # Scale based on workflow job queue
  scaleUpTriggers:
    - duration: "2m"
      amount: 1
  scaleDownDelaySecondsAfterScaleOut: 300
  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.75"
      scaleDownThreshold: "0.25"
      scaleUpFactor: "2"
      scaleDownFactor: "0.5"

Step 3: Create Custom Runner Image

# Dockerfile for custom runner
FROM ghcr.io/actions/actions-runner:latest

# Install as root for system packages
USER root

# Install common dependencies
RUN apt-get update && apt-get install -y \
    curl \
    wget \
    git \
    jq \
    unzip \
    docker.io \
    python3 \
    python3-pip \
    nodejs \
    npm \
    && rm -rf /var/lib/apt/lists/*

# Install specific tools
RUN curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubectl" \
    && chmod +x kubectl \
    && mv kubectl /usr/local/bin/

RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install \
    && rm -rf aws awscliv2.zip

# Install Terraform
RUN wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg \
    && echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/hashicorp.list \
    && apt-get update && apt-get install -y terraform

# Switch back to runner user
USER runner

# Pre-cache common actions
RUN mkdir -p /home/runner/_work/_actions

Step 4: Runner Groups for Access Control

# Create runner groups in GitHub Organization Settings
# Settings > Actions > Runner groups

# Using API to manage runner groups
name: Setup Runner Groups

on:
  workflow_dispatch:

jobs:
  setup:
    runs-on: ubuntu-latest
    steps:
      - name: Create runner groups
        env:
          GH_TOKEN: ${{ secrets.ADMIN_TOKEN }}
        run: |
          # Create production runner group
          gh api -X POST /orgs/my-org/actions/runner-groups \
            -f name="production-runners" \
            -f visibility="selected" \
            -f selected_repository_ids="[123,456]" \
            -f allows_public_repositories=false

          # Create general runner group
          gh api -X POST /orgs/my-org/actions/runner-groups \
            -f name="general-runners" \
            -f visibility="all" \
            -f allows_public_repositories=false

Step 5: GPU Runner Configuration

# gpu-runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: gpu-runners
  namespace: arc-runners
spec:
  replicas: 0  # Scale from 0
  template:
    spec:
      organization: my-org
      group: ml-runners
      labels:
        - self-hosted
        - linux
        - gpu
        - nvidia
      ephemeral: true
      # GPU node selector
      nodeSelector:
        nvidia.com/gpu: "true"
      # Request GPU resources
      resources:
        requests:
          nvidia.com/gpu: 1
          cpu: "8"
          memory: "32Gi"
        limits:
          nvidia.com/gpu: 1
          cpu: "16"
          memory: "64Gi"
      # Custom GPU-enabled image
      image: ghcr.io/my-org/gpu-runner:latest
      # Volume for model cache
      volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

# Workflow using GPU runner
name: Train Model

on:
  push:
    paths:
      - 'ml/**'

jobs:
  train:
    runs-on: [self-hosted, gpu, nvidia]
    steps:
      - uses: actions/checkout@v4

      - name: Check GPU
        run: nvidia-smi

      - name: Train model
        run: python ml/train.py --use-gpu

Step 6: Security Hardening

# Secure runner deployment
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: secure-runners
spec:
  template:
    spec:
      organization: my-org
      labels:
        - self-hosted
        - secure
      # Always ephemeral for security
      ephemeral: true
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      # Container security
      containerSecurityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: false  # Runner needs to write
        capabilities:
          drop:
            - ALL
      # Network policy (apply separately)
      # Limit egress to required endpoints only

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-runners
spec:
  podSelector:
    matchLabels:
      app: runner
  policyTypes:
    - Egress
    - Ingress
  egress:
    # Allow GitHub API
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - protocol: TCP
          port: 443
    # Allow DNS
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
  ingress: []  # No inbound traffic needed

Step 7: Monitoring and Alerting

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arc-metrics
  namespace: arc-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: actions-runner-controller
  endpoints:
    - port: metrics
      interval: 30s
---
# Grafana Dashboard (JSON model)
# Alert rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: arc-alerts
spec:
  groups:
    - name: arc
      rules:
        - alert: RunnerQueueBacklog
          expr: github_actions_runner_job_queue_length > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Runner job queue is backing up"

        - alert: NoAvailableRunners
          expr: github_actions_runner_available == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "No runners available"

Systematic, production-ready debugging

Runner Type Comparison

Aspect	GitHub-Hosted	Self-Hosted (VM)	Self-Hosted (K8s)
Cost	$0.008/min	~$0.002/min	~$0.001/min
Setup	None	Medium	Complex
Maintenance	None	Medium	Low (with ARC)
Customization	Limited	Full	Full
Scaling	Automatic	Manual/Webhook	Automatic
Security	GitHub managed	Your responsibility	Your responsibility
Network access	Public only	Private + Public	Private + Public

Practice Question

Why should self-hosted runners be configured as ephemeral for security-sensitive workloads?

Questions