Interviews / DevOps & Cloud Infrastructure / GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.
A critical deployment workflow is failing intermittently. Debug and fix the issue.
100 repositories duplicate the same CI workflow. Design a reusable workflow architecture.
Workflows are consuming too many minutes and running slowly. Optimize for speed and cost.
Workflows use long-lived credentials that could be leaked. Implement secure authentication with OIDC.
GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.
We need to test across multiple OS, Node versions, and configurations. Implement efficient matrix builds.
A workflow is vulnerable to script injection attacks. Identify and fix the security issues.
Every workflow run downloads the same dependencies. Implement an effective caching strategy.
Multiple workflows share the same setup steps. Create a composite action for reuse.
Releases are manual and error-prone. Automate with semantic versioning and changelogs.
Design a deployment workflow with environment approvals, staging, and production rollbacks.
Our monorepo builds everything on every change. Implement efficient path-based workflows.
Questions
GitHub-hosted runners don't meet our requirements. Configure self-hosted runners at scale.
The Scenario
Your organization needs more than GitHub-hosted runners can provide:
Requirements:
- GPU access for ML model training
- Access to private network resources
- Larger machines (64GB RAM, 32 cores)
- Custom software pre-installed
- Compliance: builds must run in our data center
- Cost: $50k/month on GitHub-hosted runners
The Challenge
Design and implement a self-hosted runner infrastructure that’s secure, scalable, and cost-effective while meeting all requirements.
Wrong Approach
A junior engineer might just install runners on a few VMs, use the same runner for all jobs, run runners as root, or skip security hardening. These approaches create security vulnerabilities, don't scale, and make maintenance difficult.
Right Approach
A senior engineer implements ephemeral runners using Kubernetes or auto-scaling groups, isolates workloads, implements proper security controls, and automates runner lifecycle management.
Step 1: Choose the Right Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Self-Hosted Runner Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Option 1: Kubernetes (Actions Runner Controller) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Runner Pod │ │ Runner Pod │ │ Runner Pod │ │ │
│ │ │ (ephemeral) │ │ (ephemeral) │ │ (ephemeral) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ ▲ │ │
│ │ │ Scales based on pending jobs │ │
│ │ ┌────────┴────────┐ │ │
│ │ │ ARC Controller │ │ │
│ │ └─────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Option 2: VM Auto-Scaling (AWS/GCP/Azure) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Auto Scaling Group │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Runner VM │ │ Runner VM │ │ Runner VM │ │ │
│ │ │ (ephemeral) │ │ (ephemeral) │ │ (ephemeral) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ ▲ │ │
│ │ │ Webhook triggers scaling │ │
│ │ ┌────────┴────────┐ │ │
│ │ │ Webhook Service │ │ │
│ │ └─────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Step 2: Deploy Actions Runner Controller (ARC)
# Install ARC using Helm
# helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
# helm install arc actions-runner-controller/actions-runner-controller -n arc-system
# runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: org-runners
namespace: arc-runners
spec:
replicas: 2 # Minimum runners
template:
spec:
organization: my-org
labels:
- self-hosted
- linux
- x64
# Ephemeral - new runner for each job
ephemeral: true
# Resource requests
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
# Docker-in-Docker for container builds
dockerdWithinRunnerContainer: true
# Custom image with pre-installed tools
image: ghcr.io/my-org/custom-runner:latest
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
---
# Horizontal Runner Autoscaler
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: org-runners-autoscaler
namespace: arc-runners
spec:
scaleTargetRef:
kind: RunnerDeployment
name: org-runners
minReplicas: 1
maxReplicas: 20
# Scale based on workflow job queue
scaleUpTriggers:
- duration: "2m"
amount: 1
scaleDownDelaySecondsAfterScaleOut: 300
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: "0.75"
scaleDownThreshold: "0.25"
scaleUpFactor: "2"
scaleDownFactor: "0.5"Step 3: Create Custom Runner Image
# Dockerfile for custom runner
FROM ghcr.io/actions/actions-runner:latest
# Install as root for system packages
USER root
# Install common dependencies
RUN apt-get update && apt-get install -y \
curl \
wget \
git \
jq \
unzip \
docker.io \
python3 \
python3-pip \
nodejs \
npm \
&& rm -rf /var/lib/apt/lists/*
# Install specific tools
RUN curl -LO "https://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubectl" \
&& chmod +x kubectl \
&& mv kubectl /usr/local/bin/
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
&& unzip awscliv2.zip \
&& ./aws/install \
&& rm -rf aws awscliv2.zip
# Install Terraform
RUN wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/hashicorp.list \
&& apt-get update && apt-get install -y terraform
# Switch back to runner user
USER runner
# Pre-cache common actions
RUN mkdir -p /home/runner/_work/_actionsStep 4: Runner Groups for Access Control
# Create runner groups in GitHub Organization Settings
# Settings > Actions > Runner groups
# Using API to manage runner groups
name: Setup Runner Groups
on:
workflow_dispatch:
jobs:
setup:
runs-on: ubuntu-latest
steps:
- name: Create runner groups
env:
GH_TOKEN: ${{ secrets.ADMIN_TOKEN }}
run: |
# Create production runner group
gh api -X POST /orgs/my-org/actions/runner-groups \
-f name="production-runners" \
-f visibility="selected" \
-f selected_repository_ids="[123,456]" \
-f allows_public_repositories=false
# Create general runner group
gh api -X POST /orgs/my-org/actions/runner-groups \
-f name="general-runners" \
-f visibility="all" \
-f allows_public_repositories=falseStep 5: GPU Runner Configuration
# gpu-runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: gpu-runners
namespace: arc-runners
spec:
replicas: 0 # Scale from 0
template:
spec:
organization: my-org
group: ml-runners
labels:
- self-hosted
- linux
- gpu
- nvidia
ephemeral: true
# GPU node selector
nodeSelector:
nvidia.com/gpu: "true"
# Request GPU resources
resources:
requests:
nvidia.com/gpu: 1
cpu: "8"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "16"
memory: "64Gi"
# Custom GPU-enabled image
image: ghcr.io/my-org/gpu-runner:latest
# Volume for model cache
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc# Workflow using GPU runner
name: Train Model
on:
push:
paths:
- 'ml/**'
jobs:
train:
runs-on: [self-hosted, gpu, nvidia]
steps:
- uses: actions/checkout@v4
- name: Check GPU
run: nvidia-smi
- name: Train model
run: python ml/train.py --use-gpuStep 6: Security Hardening
# Secure runner deployment
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: secure-runners
spec:
template:
spec:
organization: my-org
labels:
- self-hosted
- secure
# Always ephemeral for security
ephemeral: true
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Container security
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false # Runner needs to write
capabilities:
drop:
- ALL
# Network policy (apply separately)
# Limit egress to required endpoints only# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: runner-network-policy
namespace: arc-runners
spec:
podSelector:
matchLabels:
app: runner
policyTypes:
- Egress
- Ingress
egress:
# Allow GitHub API
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
ingress: [] # No inbound traffic neededStep 7: Monitoring and Alerting
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arc-metrics
namespace: arc-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: actions-runner-controller
endpoints:
- port: metrics
interval: 30s
---
# Grafana Dashboard (JSON model)
# Alert rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: arc-alerts
spec:
groups:
- name: arc
rules:
- alert: RunnerQueueBacklog
expr: github_actions_runner_job_queue_length > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Runner job queue is backing up"
- alert: NoAvailableRunners
expr: github_actions_runner_available == 0
for: 2m
labels:
severity: critical
annotations:
summary: "No runners available" Runner Type Comparison
| Aspect | GitHub-Hosted | Self-Hosted (VM) | Self-Hosted (K8s) |
|---|---|---|---|
| Cost | $0.008/min | ~$0.002/min | ~$0.001/min |
| Setup | None | Medium | Complex |
| Maintenance | None | Medium | Low (with ARC) |
| Customization | Limited | Full | Full |
| Scaling | Automatic | Manual/Webhook | Automatic |
| Security | GitHub managed | Your responsibility | Your responsibility |
| Network access | Public only | Private + Public | Private + Public |
Practice Question
Why should self-hosted runners be configured as ephemeral for security-sensitive workloads?