Enterprise Runners¶

Overview¶

Base Coat leverages enterprise runners from the shared-build-agents pool to enable scalable, high-performance CI/CD workflows. Enterprise runners provide specialized hardware capabilities, network isolation, and advanced security features beyond what GitHub-hosted runners offer.

Available Runner Pools¶

The following enterprise runner pools are available:

shared-build-agents - Primary pool for general-purpose CI/CD tasks
shared-build-agents-gpu - GPU-accelerated runners for ML/AI workloads
shared-build-agents-large - High-memory, multi-CPU runners for intensive operations

Runner Labels and Capabilities¶

Enterprise runners expose capabilities through labels that allow precise runner selection in workflow YAML.

Operating System Labels¶

Label	OS	Version	Notes
`ubuntu-latest`	Linux	Ubuntu 22.04 LTS	Standard Ubuntu runner
`ubuntu-22.04`	Linux	Ubuntu 22.04	Pinned version
`ubuntu-20.04`	Linux	Ubuntu 20.04	Legacy support
`windows-latest`	Windows	Windows Server 2022	Standard Windows runner
`windows-2022`	Windows	Windows Server 2022	Pinned version
`macos-latest`	macOS	macOS 12+	Intel-based
`macos-arm64`	macOS	macOS 12+	Apple Silicon (M1/M2)

Architecture-Specific Labels¶

Label	Architecture	Use Cases
`x86_64`	Intel/AMD 64-bit	Standard Linux/Windows workloads
`arm64`	ARM 64-bit	Embedded systems, mobile, Apple Silicon
`armv7`	ARM 32-bit	Legacy ARM systems

GPU-Enabled Runners¶

GPU runners are available for machine learning, image processing, and compute-intensive workloads.

Label	GPU	VRAM	Notes
`gpu-nvidia-a100`	NVIDIA A100	80GB	High-performance AI training
`gpu-nvidia-a10`	NVIDIA A10	24GB	General ML inference
`gpu-nvidia-t4`	NVIDIA T4	16GB	Cost-effective inference

Large Runner Labels¶

Label	CPU Cores	Memory	Storage	Notes
`large-x86`	16+	64GB	500GB+	Intensive builds, large test suites
`large-memory`	8	256GB	1TB+	Memory-intensive workloads
`large-storage`	8	64GB	2TB+	Large artifact handling

Combined Label Examples¶

Runners may have multiple labels assigned. Common combinations include:

shared-build-agents
ubuntu-latest
x86_64
self-hosted
linux

shared-build-agents-gpu
ubuntu-latest
gpu-nvidia-a100
x86_64
self-hosted
linux

shared-build-agents-large
ubuntu-latest
large-x86
self-hosted
linux

Runner Selection in Workflows¶

Basic Runner Selection¶

Specify a single runner or multiple labels in your workflow:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run build

Multiple Label Requirements¶

Use an array to require multiple labels (AND logic):

jobs:
  gpu-inference:
    runs-on:
      - self-hosted
      - linux
      - gpu-nvidia-a100
    steps:
      - uses: actions/checkout@v4
      - run: python inference.py

Fallback Runner Selection¶

Use a strategy matrix to try runners in order of preference:

jobs:
  build:
    strategy:
      matrix:
        runner:
          - large-x86
          - ubuntu-latest
    runs-on:
      - ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
      - run: npm run build

Enterprise Runner with Specific Architecture¶

Target ARM64 architecture on macOS:

jobs:
  build-apple-silicon:
    runs-on:
      - macos-arm64
      - self-hosted
    steps:
      - uses: actions/checkout@v4
      - run: npm run build

Self-Hosted vs GitHub-Hosted Runners¶

GitHub-Hosted Runners¶

GitHub-hosted runners are ephemeral, managed by GitHub, and included with GitHub Actions.

Pros:

No infrastructure management required
Automatic OS patching and updates
Immediate availability, no queue
Clean environment for each job
Included in plan (2,000 minutes/month free on public repos)

Cons:

Limited to 20 GB of storage
7 GB of RAM (standard), 14 GB (large)
Slower for large monorepos or extensive artifact handling
Cannot access private networks or on-premises resources
Limited customization options

Self-Hosted Enterprise Runners¶

Self-hosted runners are provisioned from the enterprise runner pool and managed by the infrastructure team.

Pros:

Customizable hardware (CPU, memory, storage, GPU)
Access to private networks and on-premises resources
Persistent caching reduces build times
Cost-effective for high-volume CI/CD
Support for specialized dependencies (CUDA, ML frameworks)
Network isolation and security controls

Cons:

Requires infrastructure maintenance
Potential queue during peak usage
Security responsibility shifts to team
Must manage runner software updates
Idle runner capacity costs

Decision Matrix¶

Requirement	GitHub-Hosted	Self-Hosted
Basic Node/Python builds	✅ Preferred	⏸ Overkill
GPU workloads	❌ Unavailable	✅ Required
>100GB artifacts	❌ Limited	✅ Supported
Private network access	❌ No	✅ Yes
High-volume CI/CD	❌ Expensive	✅ Cost-effective
Regulatory compliance	❌ Shared tenancy	✅ Isolated
Persistent cache	❌ Not available	✅ Supported

Security Considerations¶

Network Isolation¶

Enterprise runners operate in isolated network segments with restricted egress rules.

Default Outbound Access¶

✅ GitHub API and repository access
✅ npm, PyPI, and other public package registries
❌ Direct internet access restricted
❌ Private network access requires VPN tunnel

Requesting Network Access¶

Contact the infrastructure team to add network routes for specific services:

Request format:
- Destination CIDR or hostname
- Port and protocol (TCP/UDP)
- Justification and use case
- Expected data volume

Secrets Management¶

Secrets are encrypted at rest and in transit. Handle with care:

Best Practices¶

Use GitHub Secrets for sensitive data (tokens, credentials)
Never log secrets in workflow output
Rotate secrets regularly (quarterly minimum)
Use separate secrets for each environment (dev, staging, prod)
Audit secret access through GitHub's audit logs

Example: Safe Secret Handling¶

jobs:
  deploy:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - name: Deploy with credentials
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
        run: |
          # Token is available only in this step
          ./scripts/deploy.sh
          # Token is not logged by default
          echo "Deployment complete"

Access Control¶

Runner access is governed by GitHub team membership and repository permissions.

Permission Levels¶

Repository Collaborators - Can trigger workflows (all runners available)
Organization Members - Can access shared runners if granted permission
External Contributors - Workflows require approval before running on enterprise runners

Approval Workflow¶

Contributor opens pull request
GitHub Actions workflow requires approval (first-time contributors)
Repository maintainer reviews PR and approves workflow run
Workflow proceeds on appropriate runner

Audit Logging¶

All runner activity is logged and auditable:

# View runner usage in GitHub

gh api repos/owner/repo/actions/runs \
  --paginate \
  --jq '.workflow_runs[] | {name, status, runner_name, created_at}'

Capacity Planning and Autoscaling¶

Current Capacity¶

Pool	Total Runners	Typical Queue Time	Peak Utilization
shared-build-agents	16	< 1 min	80-90%
shared-build-agents-gpu	4	5-10 min	85-95%
shared-build-agents-large	8	2-5 min	75-85%

Autoscaling Policy¶

The enterprise runner infrastructure auto-scales based on queue depth and wait times.

Scaling Triggers¶

Queue depth > 5 jobs for > 5 minutes → scale up
Average wait time > 10 minutes → scale up
CPU utilization > 90% for > 10 minutes → scale up
Idle runners > 30 minutes → scale down

Scaling Limits¶

Maximum concurrent runners: 32 per pool
Scale-up increment: 2-4 runners per event
Scale-down window: 30-60 minutes idle time

Monitoring Queue Depth¶

Check runner availability before scheduling resource-intensive jobs:

jobs:
  check-capacity:
    runs-on: ubuntu-latest
    steps:
      - name: Check GPU runner availability
        run: |
          curl -s <https://runner-api.internal/capacity/gpu> | jq .
          # Response: {"available": 2, "total": 4, "queue_length": 3}

Capacity Requests¶

For sustained capacity increases, submit a capacity request:

Subject: Enterprise Runner Capacity Request - [PROJECT]

Current Usage:
- Pool: shared-build-agents-gpu
- Daily runs: 50-100
- Average queue time: 15-20 minutes
- Peak concurrent jobs: 8-10

Requested Capacity:
- Additional runners: 4x NVIDIA A100
- Justification: Parallel training jobs for ML pipeline
- Expected ROI: 30% faster model iteration

Timeline: Needed by [DATE]

Troubleshooting Common Runner Issues¶

Runner Not Available¶

Symptom: Workflow stuck in "Waiting for a runner..." state.

Diagnosis¶

Check runner labels match job requirements:

gh api repos/owner/repo/actions/runs/[run_id] \
  --jq '.jobs[] | {name, status, labels: .runs_on}'

Verify runner pool status:

curl -s <https://runner-api.internal/status> | jq '.pools'

Check for labeling issues:

# ❌ Wrong - array of strings treated as OR

runs-on:
  - self-hosted
  - gpu-nvidia-a100

# ✅ Correct - array requires all labels (AND)

runs-on:
  - self-hosted
  - gpu-nvidia-a100

Resolution¶

Wait for capacity or request additional runners
Use fallback runner: runs-on: ubuntu-latest
Schedule job during off-peak hours
Split large jobs into smaller parallel tasks

Slow Runner Performance¶

Symptom: Job takes significantly longer than previous runs.

Diagnosis steps

Check runner CPU/memory during job:

# On the runner (via SSH)

top -b -n 1 | head -20
free -h
df -h /var/lib/docker

Compare with baseline metrics:

gh run view [run_id] --json jobs --jq '.jobs[] | {name, conclusion, durationMinutes}'

Check for I/O contention:

iostat -x 1 5  # Run during workflow execution

Resolution:

Switch to larger runner: large-x86 instead of standard
Optimize job to use less resources (parallel steps, caching)
Schedule during off-peak to avoid contention
Check for runaway processes: ps aux --sort=-%cpu

Network Connectivity Issues¶

Symptom: Download/upload failures, package registry timeouts.

Diagnosis steps

Test basic connectivity:

curl -v <https://registry.npmjs.org>
ping 8.8.8.8
traceroute github.com

Check network policies:

# From runner, attempt restricted destinations

curl -v <https://internal-service.local>

Review firewall logs (infrastructure team):

sudo journalctl -u firewall | tail -50

Resolution:

Retry with exponential backoff in workflow
Use CDN/mirror for package registry if available
Request network route approval for private services
Check for DNS resolution issues: nslookup package-name.com

Out of Disk Space¶

Symptom: Job fails with no space left on device.

Diagnosis steps

Check disk usage:

df -h /
du -sh /* | sort -hr | head -10

Identify large files:

find / -type f -size +1G 2>/dev/null
docker system df

Review artifact retention:

ls -lah ~/.cache ~/.m2 /var/lib/docker/containers

Resolution:

Clean workspace between jobs: rm -rf *
Use Docker layer caching: actions/setup-buildx-action@v3
Compress artifacts before upload
Reduce artifact retention: retention-days: 5
Request larger runner pool: large-storage

Integration with Base Coat CI Workflows¶

Standard Base Coat Workflow¶

Base Coat CI uses a matrix strategy to test across multiple runners:

name: Base Coat CI

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]

jobs:
  test:
    strategy:
      matrix:
        runner:
          - ubuntu-latest
          - ubuntu-22.04
        node-version:
          - 18
          - 20
    runs-on:
      - ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm ci
      - run: npm run test
      - uses: codecov/codecov-action@v3

GPU-Accelerated Builds¶

For ML model training and inference:

jobs:
  train-model:
    runs-on:
      - self-hosted
      - gpu-nvidia-a100
    container:
      image: nvidia/cuda:12.2.0-devel-ubuntu22.04
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install -r requirements-gpu.txt
      - name: Train model
        run: python train.py --epochs 100 --batch-size 256
      - name: Upload checkpoint
        uses: actions/upload-artifact@v3
        with:
          name: model-checkpoint
          path: checkpoints/

Large-Scale Build¶

For monorepos and intensive compilation:

jobs:
  build-monorepo:
    runs-on:
      - self-hosted
      - large-x86
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for size calculation
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - run: npm run build:all
      - name: Upload build artifacts
        uses: actions/upload-artifact@v3
        with:
          name: build-output
          path: dist/
          retention-days: 30

Conditional Runner Selection¶

Route jobs to appropriate runners based on conditions:

jobs:
  adaptive-test:
    runs-on: ${{ github.event_name == 'pull_request' && 'ubuntu-latest' || 'self-hosted' }}
    steps:
      - uses: actions/checkout@v4
      - run: npm run test

Cross-Platform Builds¶

Test across multiple OS targets:

jobs:
  build-all-platforms:
    strategy:
      matrix:
        include:
          - os: ubuntu-latest
            artifact: app-linux
          - os: windows-2022
            artifact: app-windows.exe
          - os: macos-arm64
            artifact: app-macos
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run build
      - uses: actions/upload-artifact@v3
        with:
          name: ${{ matrix.artifact }}
          path: dist/

Contact and Support¶

For issues, questions, or capacity requests:

Infrastructure Team: #infrastructure-support (Slack)
GitHub Actions Docs: https://docs.github.com/en/actions
Enterprise Runner Status: https://runner-status.internal
Capacity Planning: Submit request via infrastructure portal