Efficient Kubernetes Log Aggregation with Vector

Table of Contents

The problem with ElasticSearch and Loki #

While Kubernetes has become the de facto standard for container orchestration, one of the problems that the distributed nature of Kubernetes introduces is the long term storage of logs for containers which are no longer actively running in the cluster.

Typical solutions which one may reach for are ElasticSearch (or it’s sister project OpenSearch), Grafana Loki, AWS Cloudwatch or similar.

While these solutions do solve the problem in a way, they introduce a whole new set of problems. They are a maintanence burden, requiring carefully speced nodes to run on and can become prohibitevely expensive for clusters which generate large amount of logs. In my experience Loki is unable to deal with the high cardinality logs generated by kubernetes, and OOMs under extremely simple queries. Stripping away high cardinality labels like container id, pod names or pod template hashes may work but is a lot of work. OpenSearch isn’t really meant for logs and takes an entire datacentre for what I would consider a quite reasonable amount of logs, while everything else it generally too unstable or expensive. Many of the new logging solutions aim to support storing logs in S3, which by all measures and metrics is terribly suited for constant the appending and grepping needed for logs.

Cluster operators may find that these solutions make up a significant fraction of the entire cluster cost, aren’t as reliable as they claim to be under resource constraints. Moreover, one often finds they can not store as much logs as they would like to, without spending and arm and a leg.

The Cheap, Grepable and “Free” Logs of Old #

Traditionally (before systemd), logs in Linux-based systems are stored in the /var/log directory.

To manage log files, old-school systems use a utility called logrotate, which is typically configured to run daily.

By default, logrotate compresses rotated log files using the gzip compression algorithm, which reduces the size of the log files and saves disk space. Compressed log files are typically named using the pattern <log-filename>.<date>.gz, where <log-filename> is the name of the log file, <date> is the date on which the log file was rotated, and .gz indicates that the file has been compressed using gzip.

For example, if the nginx.log file is rotated on January 1, 2022, the rotated and compressed log file would be named nginx.log.20220101.gz. This convention makes it easy to identify and manage log files by date, as compressed log files can be easily sorted and searched using file management tools such as zgrep and zless.

This is what I set out to achieve with Kubernetes, I wanted all my logs available in a structured directory tree, gzipped and rotated by day.

The goal is to have logs stored on a persistent volume by namespace and “application instance”:

ingress-nginx
├── ingress-nginx-20230415.log.gz
└── ingress-nginx-external-20230415.log.gz
kube-system
├── calico-kube-controllers-20230415.log.gz
├── calico-node-20230415.log.gz
├── kube-apiserver-20230415.log.gz
├── kube-controller-manager-20230415.log.gz
└── snapshot-controller-20230415.log.gz

Vector #

Vector is a new a lightweight and efficient log forwarding tool written is rust that is gaining popularity in the Kubernetes ecosystem. It can grab logs from kubernetes nodes and forward them on to pretty much anything, including Loki, OpenSearch and more.

In this article, we will discuss how to use two separate configurations of vector, which will result in neatly sorted logs written directly to a Persistent Volume Claim (PVC) without the use of any other aggregation system.

If you don’t have flux installed already, I recommend you install it so you can install vector helm charts with the HelmRelease custom resource instead of directly running helm. Nevertheless, you may just use the values section and install it directly with the helm cli if you like.

We first install vector in “agent” mode. This will deploy a daemonset whereby each host node will have a vector agent which grabs local log files and attempts to forward them and respective metadata to the vector-aggregator service, which we’ll in a moment:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: vector-charts
spec:
  url: https://helm.vector.dev
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: vector-agent
spec:
  chart:
    spec:
      chart: vector
      version: 0.24.1
      sourceRef:
        kind: HelmRepository
        name: vector-charts
        namespace: flux-system
  values:
    role: Agent
    customConfig:
      data_dir: /vector-data-dir
      sources:
        kubernetes_logs:
          type: kubernetes_logs
      sinks:
        vector:
          type: vector
          inputs: [kubernetes_logs]
          address: "vector-aggregator:6000"

The config couldn’t really be simpler, vector abstracts all the details about grabbing kubernetes logs away from us and just does it when the kubernetes_logs source is specified.

Our single sink is will be an aggregator, which will be the one receiving all the logs and writing them to disk. While in theory all agents could mount an NFS volume and write to files directly themselves, we’d run into performance issue and interleaving issues if we want logs for the same deployment to go into the same file.

Vector as an Aggregator and Writer #

The config for the aggregator is a little bit longer but still relatively compact:

kind: HelmRelease
metadata:
  name: vector-aggregator
spec:
  chart:
    spec:
      chart: vector
      version: 0.24.1
      sourceRef:
        kind: HelmRepository
        name: vector-charts
        namespace: flux-system
  values:
    role: Aggregator
    customConfig:
      data_dir: /vector-data-dir
      sources:
        vector:
          address: 0.0.0.0:6000
          type: vector
          version: "2"
      transforms:
        sort:
          type: remap
          inputs: [vector]
          source: |-
            if exists(.kubernetes.pod_labels."app.kubernetes.io/instance") {
              .filename = .kubernetes.pod_labels."app.kubernetes.io/instance"
            } else if exists(.kubernetes.pod_labels.app) {
              .filename = .kubernetes.pod_labels.app
            } else if exists(.kubernetes.container_name) {
              .filename = .kubernetes.container_name
            } else {
              .filename = "unlabeled"
            }

            if exists(.kubernetes.pod_namespace) {
              .folder = .kubernetes.pod_namespace
            } else {
              .folder = "unlabeled"
            }

            .pod = .kubernetes.pod_name
            .container = .kubernetes.container_name            
      sinks:
        files:
          type: file
          inputs: [sort]
          encoding:
            codec: json
            only_fields:
              - timestamp
              - message
              - stream
              - pod
              - container
          path: /var/log/k8s/{{ "{{" }} .folder {{ "}}" }}/{{ "{{" }} .filename {{ "}}" }}-%Y%m%d.log
    extraVolumes:
      - name: logging-pvc
        persistentVolumeClaim:
          claimName: logging-pvc
    extraVolumeMounts:
      - name: logging-pvc
        mountPath: /var/log/k8s

We use “vector” as the source, telling the aggregator to listen on port 6000 for logs sent in by the agents running on each node.

The transforms is where the magic happens. We create an additional filename field equal to the app.kubernetes.io/instance label if it exists. If it doesn’t exist we then try the app label and fallback to container_name

We then set the directory label to the namespace of the pod which logged the line.

For some very short-lived containers (like jobs which completed in seconds) vector is unable to grab associated metadata with the log line, so the kubernetes metadata is missing, in that case we set both directory and filename to unlabeled so we don’t lose the log line. This problem isn’t limited to vector and happens with fluent-bit as well.

Finally we configure the sink which is a PVC mounted to the aggregator. The manifest above assumes a separately created a logging-pvc which can just be mounted on /var/log/k8s in the vector-aggregator pod.

We then configure the log path to use the folder and filename (namespace and app instance) via yaml templating. Note that in our example we hade to escape the templating to work around helm.

Lastly we use only_fields to reduce the size of the log. By default each log lines is massive as it includes pod labels, container hash, replicaset id and more. As we already sort logs into files by namespace and app instance, the only valuable labels are the pod name and container name which let you filter down log lines a bit more. Ofcourse we also keep the timestamp and the actual log line message.

You can actually configure vector log straight to gzip or zstd files, but note that it introduced delays and risk of log file corruption in case of restarts/interruption. Furthermore it’s often useful to tail -f today’s file in real time, which you can’t do as easily with buffered and compressed files.

Lets have a look at the logs that the aggregator is writing:

▶ k -n default get deploy blog
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
blog   3/3     3            3           219d

And are written to the same log file:

% tail -1 blog-20230909.log | jq
{
  "container": "blog",
  "message": "10.244.174.47 - - [09/Sep/2023:18:49:21 +0100] \"GET /favicon-32x32.png HTTP/1.1\" 200 1457 \"https://sko.ai/blog/how-to-run-ha-mosquitto/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36\" \"162.158.239.26\"",
  "pod": "blog-7d75d75449-llpmc",
  "stream": "stdout",
  "timestamp": "2023-09-09T17:49:21.836351874Z"
}

Archiving #

As the vector aggregator creates new files daily, we’ll want to setup some archiving in order to compress old log files and eventually delete them. The following CronJob comes to the rescue:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: archiver
  annotations:
    kustomize.toolkit.fluxcd.io/substitute: disabled
spec:
  successfulJobsHistoryLimit: 3
  suspend: false
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  # 5 minutes after midnight
  schedule: '5 0 * * *'
  timeZone: "Etc/UTC"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: archive
            image: public.ecr.aws/docker/library/busybox:1.36.1@sha256:3fbc632167424a6d997e74f52b878d7cc478225cffac6bc977eedfe51c7f4e79
            command: [sh, -c]
            # Delete all files older than 30 days
            # Compress all files older than 1 day
            # Symlink latest foo-$date.log to foo.log
            args:
            - >
              find /var/logs/k8s -type f -mtime '+30' -delete -print &&
              find /var/logs/k8s -type f -mtime '+0' -name '*.log' -exec gzip {} \; &&
              find /var/logs/k8s -type f -name "*$(date +%Y%m%d).log" | while read latest; do
                target="${latest%-*}.log";
                ln -s -f $(basename "$latest") "$target";
              done              
            resources:
              requests:
                cpu: 10m
                memory: 10Mi
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop: ["ALL"]
            volumeMounts:
            - name: logging-pvc
              mountPath: /var/logs/k8s
          restartPolicy: Never
          volumes:
          - name: logging-pvc
            persistentVolumeClaim:
              claimName: logging-pvc

The cron will:

Delete files older than 30 days
Compress all but todays logs
Symlink foo-$today.log to foo.log for convenience

Note that we can’t actually use the real logrotate, as we need to use vector’s date based templating. At present vector does not recreate the file if it’s moved and keeps writing to the old file handle. While you could use copytruncate it’s not recommended as you’ll likely lose data and have broken json lines at the time of rotation.

Conclusion #

That’s it! Assuming /var/log/k8s on the vector aggregator side is persistent, your logs wil be retained and an easily searchable and viewable fashion, allowing you to use all the cli tools you are used to. No clunky web frontend, or fat processing nodes necessary.

You can find all the manifests for this post on my github repository