Enterprise Kubernetes Operational Patterns¶
This document provides best practices for operating Kubernetes workloads in enterprise environments, with emphasis on Azure Kubernetes Service (AKS).
Cluster Architecture¶
Node Pools¶
Use multiple node pools for workload isolation:
yaml
apiVersion: Microsoft.ContainerService/managedClusters
kind: AKS
metadata:
name: prod-cluster
spec:
agentPoolProfiles:
- name: system
count: 3
vmSize: Standard_DS2_v2
osType: Linux
mode: System # System pods only
- name: workload
count: 6
vmSize: Standard_D4s_v3
osType: Linux
mode: User # User workloads
- name: gpu
count: 2
vmSize: Standard_NC6s_v3 # GPU VMs
osType: Linux
mode: User
Pod Security Standards¶
Enforce pod security policies:
- Restricted: Minimal privileges; read-only filesystem
- Baseline: Allow common configurations
- Unrestricted: Legacy workloads (deprecated)
Network Policies¶
Control traffic between pods:
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
role: frontend
Observability¶
Prometheus and Grafana¶
Collect metrics and visualize:
- CPU/Memory usage: Per pod, per node
- Request latency: P50, P95, P99
- Error rates: By service and endpoint
- Custom metrics: Business events
Container Insights¶
Native AKS monitoring:
- Cluster health: Node, pod, container metrics
- Performance: Compare baseline vs. current
- Logs: Container stdout/stderr in Log Analytics
- Alerts: Thresholds for resource utilization
Security¶
RBAC¶
Control who can access what:
`ash
Create role for developers¶
kubectl create role developer --verb=get,list,watch --resource=pods kubectl create rolebinding dev-binding --role=developer --user=developer@company.com `
Network Policies¶
Restrict inter-pod communication by default:
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
Resource Management¶
Resource Requests and Limits¶
Define resource requirements for each pod:
yaml
apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Horizontal Pod Autoscaler¶
Scale pods based on metrics:
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Vertical Pod Autoscaler¶
Right-size resource requests based on usage.
Storage¶
StatefulSets¶
For applications requiring stable identity:
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db-statefulset
spec:
serviceName: db-service
replicas: 3
selector:
matchLabels:
app: db
template:
metadata:
labels:
app: db
spec:
containers:
- name: db
image: postgres:14
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
Persistent Volumes¶
Use Azure Disk or Azure Files:
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium
resources:
requests:
storage: 50Gi
Deployment Strategies¶
Rolling Update¶
Gradually replace old pods with new:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # No downtime
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
containers:
- name: app
image: myapp:v2
Blue-Green Deployment¶
Run two identical environments; switch traffic:
- Blue: Current production version
- Green: New version (tested in parallel)
- Switch: Point load balancer to green; monitor
Backup and Disaster Recovery¶
Velero¶
Backup Kubernetes cluster state:
`ash
Install Velero¶
velero install --provider azure --bucket mybackups --secret-file credentials-azureblobstorage
Create backup¶
velero backup create my-backup
Restore from backup¶
velero restore create --from-backup my-backup `
Cluster Recovery¶
Steps to recover from cluster failure:
- Create new AKS cluster in same region
- Restore applications from backup (Velero)
- Point DNS to new cluster
- Validate data and functionality
Base Coat Assets¶
Related agents & skills:
- Agent: \gents/azure-kubernetes.agent.md\ — AKS cluster design and operations
- Agent: \gents/sre-engineer.agent.md\ — Reliability patterns and monitoring
- Skill: \skills/kubernetes-operators/\ — Automation and orchestration
- Instruction: \instructions/kubernetes-security-hardening.instructions.md\
Next Steps¶
- Design: Plan node pools, pod security, and network policies
- Deploy: Create AKS cluster with monitoring enabled
- Secure: Implement RBAC and network policies
- Monitor: Set up Prometheus/Grafana and Container Insights
- Backup: Configure Velero for cluster recovery