DevOpsCI/CDKubernetesInfrastructureSecurity
Why Your CI/CD Looks Fine But Still Breaks on Fridays: 3 Hidden Kubernetes Mistakes
Three real configuration mistakes in CI/CD and Kubernetes that look correct but cause production incidents. Practical fixes with code examples for Moldova, Romania and EU companies.
W
WebDirect TeamYour CI/CD pipeline is set up. Tests pass. Linters are quiet. Deployments are automated. And yet — on Friday evening something breaks, clients write in, and you're handling an incident instead of dinner. Sound familiar? It's not that your DevOps is 'bad.' It's a few non-obvious configuration mistakes that look perfectly fine on paper, but quietly destroy production stability.If Friday is more dangerous than Tuesday, you have a systemic problem — not bad timing. A reliable pipeline doesn't know what day of the week it is. In this article, we break down three real mistakes we regularly find when auditing infrastructure for companies in Moldova, Romania, and across the EU — with code, explanation, and concrete fixes.Mistake #1 — No Resource Limits: Your Pods Consume Everything in SightWhen a developer creates a Kubernetes deployment without
resources.requests and resources.limits, the scheduler doesn't know how many resources the pod needs. It places it 'wherever it fits.' Under peak load, one service can consume all available memory on a node — triggering a 'noisy neighbour' effect: other pods get OOMKilled or throttled for no visible reason. The cost of this mistake: cascading failure → multiple services unavailable simultaneously.The wrong configuration looks like this: a deployment with no resources specified, forcing the scheduler to guess. The correct approach is to explicitly set both requests (guaranteed minimum) and limits (maximum allowed). For example: requests of 256Mi memory and 250m CPU, with limits of 512Mi memory and 500m CPU.For a practical fix: run kubectl top pods in staging under real load. Use the p95 consumption as requests, and p99 + 30% as limits. Set up a LimitRange on the namespace — pods without explicit limits will get sensible defaults, not 'everything available.'Mistake #2 — RollingUpdate Without readinessProbe: Deployment 'Succeeds,' Service is UnavailableBy default, Kubernetes considers a pod 'ready' as soon as the container starts. For a simple Hello World — that's fine. For a real application that needs 20–60 seconds to initialize — it's a disaster. RollingUpdate sends traffic to the new pod before it's ready to accept it. Users get errors. CI/CD reports 'deployment successful' — because technically the pods started. The cost of this mistake: 2–10 minutes of downtime per deployment × N deployments per day.The readinessProbe endpoint should return 200 only when the app is genuinely ready: connected to DB, cache warmed, configs loaded. Meanwhile, livenessProbe checks if the process is still alive and restarts the container if needed. A common mistake is using the same endpoint for both probes — this causes healthy pods to restart simply because they initialize slowly.For a practical fix: add automatic rollback to CI — if pods don't become Ready within N seconds, the deployment rolls back automatically. Configure separate probes with appropriate delays. Most applications benefit from an initialDelaySeconds of 10–30 seconds before readiness checks begin.Mistake #3 — `image:latest` in Production: Every Deployment is UnpredictableThe latest tag is not a version. It means 'give me whatever is current at the time of pull.' Every time Kubernetes recreates a pod (node failure, autoscaling, manual restart), it may pull a different image than was used in the last deployment. When upstream updates the base image and your dependency stops working with the new OpenSSL version — you get production incidents from code you didn't change. The cost of this mistake: unpredictable regressions + security vulnerabilities from uncontrolled base image updates.The comparison is clear: image: myapp:latest means different pods run different code (dangerous). A semver tag like image: myapp:v1.4 can be overwritten (risky). But image: myapp:1.4.2-abc1234 with Git SHA + semver is reproducible and traceable (good). For the best guarantee, use digest: image: myapp@sha256:a1b2c3... — this ensures the exact same image every time.For a practical fix: in CI pipeline, tag images with Git SHA (TAG=$(git rev-parse --short HEAD)). Pin the exact version in deployment.yaml. Set imagePullPolicy: IfNotPresent. Finally, configure an Admission Controller or OPA policy to automatically reject deployments using the latest tag. Takes an hour to set up, permanently eliminates an entire class of incidents.Beyond the Big Three: Silent Infrastructure KillersYour infrastructure may also suffer from: missing PodDisruptionBudget (node upgrades kill all replicas simultaneously), secrets in ConfigMap instead of Kubernetes Secrets (exposed via kubectl describe), no HorizontalPodAutoscaler (pay for idle resources or crash under traffic spikes), no Docker layer caching in CI (15-minute builds instead of 3 minutes), and missing network policies (any compromised pod talks to any service in the cluster).30-Minute Self-Diagnosis ChecklistRun these commands in your cluster: kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.limits == null) | .metadata.name' to find pods without resource limits. Check for pods missing readinessProbe. Search for images using the :latest tag with grep ':latest'. Verify: all pods have resources.requests and resources.limits, separate readinessProbe and livenessProbe for each service, no latest tag in production, PodDisruptionBudget for critical services, and automated rollback if pods don't become Ready within N seconds.FAQ: Common QuestionsCan I set the same values for requests and limits? Technically yes — this gives the Guaranteed QoS class. But if the application has legitimate consumption spikes, it will be throttled even when the cluster has sufficient resources. Use different values. How often should resource limits be reviewed? After every significant application change and quarterly for the full infrastructure. Use Vertical Pod Autoscaler in recommend mode for automatic suggestions. What if a legacy service has no health endpoint? Use an exec probe via a command inside the container, or a tcpSocket probe to check port availability — either is better than nothing. Is this relevant for small businesses in Moldova and Romania? Absolutely. These mistakes are critical even for clusters with 3–5 services. The cost of one hour of downtime for an e-commerce business in Moldova or Romania significantly exceeds the cost of configuring things correctly from the start.The Bottom LineThese three mistakes aren't exotic edge cases. We encounter them regularly when auditing infrastructure for companies in Moldova, Romania, and across the EU, regardless of team size or stack. The good news: each one can be fixed in a few hours. If you'd like an audit of your infrastructure and CI/CD pipeline — at WebDirect we offer a free IT Health Check with a concrete report on your specific system, no generic advice. Let us analyze your environment and provide a clear, honest assessment with a concrete plan for improvement.Need Expert Help?
Our team is ready to help you implement the strategies discussed in our articles.
