DevOps might actually be slowing you down

Table of Contents

Many believe that DevOps can enhance productivity and value delivered by teams. However, it is my view that several DevOps implementations are impractical and slow. The average DevOps approach is likely to be ineffective, and many developers may find it less than useful.

The lack of a standardized approach to DevOps #

Despite the popularity of DevOps, there is still no clear consensus on what exactly it should entail or how it should be practiced. There are only two things that most people agree upon:

Continuous Integration/Continuous Deployment (CI/CD)
Infrastructure as Code

The developer experience #

On paper, DevOps should be great for developers. CI/CD should automatically verify code changes and run tests, freeing up time spent setting up development environments and configuring dependencies.

Slow image builds with CI #

Unfortunately, developers often face a cumbersome CI/CD process. For instance, Docker in Docker, a popular containerization tool, can be painfully slow. Before building an image, the CI usually needs to spin up a Docker in Docker instance, a process that takes several seconds on Gitlab. Some CIs lack well-established setups, which can result in lengthy build times.

Furthermore, Docker layer cache is not available in most CI tools like GitHub Actions, Gitlab, Tekton, and also Drone. Although recent versions of Docker or Kaniko allow users to push layer cache to the registry, this method can still be slow for larger images, such as those common in large machine learning frameworks. Additionally, building without a cache may be faster in many cases than pulling cache for each build. While some CI tools offer ways to save artifacts, this also takes a longer time and is difficult to set up. Upgrading a dependency may take longer, as you will have to spend more time pulling inline cache only to find out you don’t need it.

The only way to achieve fast builds is to use local layer cache. This can be achieved using a static buildkitd instance that is exposed over TCP. By ensuring that all Dockerfiles use RUN with cache mounts, developers can reduce build times from several minutes to seconds. Unfortunately, developers often face a cumbersome CI/CD process.

Slow test suites with CI #

Developers often encounter slow test which the have to wait for after rebuilding the containers, even for single line changes. The CI usually runs all the tests instead of the specific unit test that the developer is interested in. The test suite may take several minutes to complete, and the results may not be available until several minutes after the container is built. This slow feedback loop can delay the developer’s workflow.

Difficulty using the CI #

Not only is your typical feedback loop painfully slow, but it is also difficult to use. Developers must commit their fix, push it, and then navigate the UI to find the new CI run that they have launched. If the run fails, it could fail at any stage before the test stage. When the test bubble begins running, developers must click to view their test output, which can take additional time to locate.

The result is a CI that developers do not want to use when developing. Since the CI pipeline is designed explicitly for the CI runner and requires push/pull credentials, it is not always possible to run it locally. Developers may resort to using docker build locally and manually running tests instead. If they do not want to use Docker, they may choose to install all dependencies traditionally and hope that the tests pass in the CI.

CD via CI is a Security Vulnerability #

You might argue that my criticisms miss the point and that the primary goal of CI/CD testing is not to replace manual testing, but to ensure that the production container passes the tests before it is deployed by CD. However, some companies, such as Gitab and GitHub, have previosly recommended running kubectl or other privileged tools directly via the CI, which poses significant security risks and can be less reliable than claimed, especially when deploying YAML to Kubernetes and considering that your average Docker in Docker comapatible runner runs as a privileged container. The CI may lack knowledge of the cluster it deploys to or its current state. This leads to my next point about Infrastructure as Code.

Infrastructure as Code via CI is Unreliable #

Until very recently, if following “best practices” advised by the major CI vendors, you may have ended up with a lot of YAML files in Git that omit your production secrets, in which you store in CI environment variables. Then you might rely on the CI to build your infrastructure from code, which can be inflexible and fragile. You may need to manually check and verify that your code definition matches your actual infrastructure, with no feedback loop between them.

Your CI is a Unique Snowflake of Shell Scripts Embedded in YAML #

Let’s be honest, the typical CI consists of a long YAML file with many shell commands executed in various Docker containers. You may not be certain whether you’re running in a fully bash-compatible shell due to the use of Alpine-based containers. You also heavily rely on environment variables, and your shell scripts may differ from those of others. For instance, how do you version your code? Do you package your application for a specific distro, or do you provide it as a native library? Every project has its own approach, and it can be confusing to figure out how they all fit together. You may dread having to read the extensive shell+YAML soup in the repository to deploy your quick fix, which you’ve already tested without waiting for the CI on your feature branch to finish.

Best Practices for CI/CD and Infrastructure as Code #

To ensure a robust and secure CI/CD process, consider the following practices:

Use Buildkitd on a machine with ample disk space for caching. Employ cache mounts in your Dockerfile for npm, go, pip, apt, and so on.
Define your Kubernetes manifests using Kustomize and use overlays for production, staging, and local development.
Employ skaffold or tilt for rapid feedback during local development, avoiding manual environment setup.

Use Flux or ArgoCD to reconcile IAC into the cluster intelligently and securely, while avoiding “CIOps”.