Monitoring GitLab on Kubernetes

TL;DR

If you want decent monitoring of GitLab deployed via GitLab’s helm charts, skip the built-in Prometheus config and go straight to InfluxDB. Make sure you enable InfluxDB’s UDP interface and replace the autogenerated, default retention policy with a 1h policy. Use GitLab’s Grafana dashboards by using jq to extract the dashboard node from each .json file and then changing the datasource name to match yours.

Distributed systems have pros and cons

A few weeks ago, I migrated our self-hosted GitLab Omnibus install to a Kubernetes deployment using GitLab’s helm charts. Since moving everyone over to using this instance for their day-to-day activities, the whole system has fallen over once or twice. The beauty of putting GitLab on Kubernetes is that horizontal autoscaling will spin up additional app servers as load increases. The downside is that there are a lot more pieces that can fail compared to an Omnibus install.

After the 2nd outage, I realized I needed better ways to get data about the state of the system. As it was, I had to guess what was going wrong based on the observed end behavior. Well, GitLab’s chart sets up Prometheus monitoring out of the box. Let’s see what we can do with that.

Skip Prometheus, go straight to InfluxDB

At first glance, not much. Prometheus was happily scraping all the GitLab pods and storing the data in time series but there was no dashboard or charts. The only available option was using Prometheus’s built-in Expression Browser which is oriented toward ad-hoc queries. I took the suggestion from Prometheus’s and GitLab’s docs and deployed Grafana to handle data visualization.

I can’t be the only person wanting to monitor GitLab installs, right? GitLab’s documentation points you to https://gitlab.com/gitlab-org/grafana-dashboards which has a dashboard for monitoring Omnibus installs. GitLab chart deployments aren’t quite the same as Omnibus but they’re kinda close. Not close enough. This dashboard provides only a high-level overview of process uptime and it doesn’t seem to work with how Prometheus is labeling the data.

The other folder in that repo is a pile of dashboards used by GitLab.com to monitor their public infrastructure. If it’s good enough for them, maybe it’ll be good enough for me. At first, I couldn’t get any of the dashboards to import. Reading the README, these are formatted as API requests rather than just the dashboard. A quick pass through jq to extract the dashboard node and I can import them into Grafana! Woohoo!

Uh, all of the queries are failing. Peeking at a few, it seems GitLab.com has evolved from using Prometheus to InfluxDB and the dashboards have a mix of the two. Most of the more detailed dashboards seem to use InfluxDB so let’s go that route. After deploying InfluxDB via Helm (and turning on the UDP API and creating a database) and telling GitLab to send data there (https://docs.gitlab.com/ee/administration/monitoring/performance/gitlab_configuration.html), I can see measurements arriving in InfluxDB. Add InfluxDB as a datasource to Grafana, change the datasource name in the dashboards, and a few of the charts start to work! But not all of them.

Diving in further to these queries, it seems GitLab.com uses InfluxDB’s retention policies and continuous queries rather extensively. These dashboards depend on the latter for a variety of things. How am I going to recreate these? Poking around on GitLab’s repos, I stumbled across https://gitlab.com/gitlab-org/influxdb-management which seems to set all of this up. First attempt at running it throws an error:

rake aborted!
InfluxDB::QueryError: retention policy duration must be greater than the shard duration
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:91:in `handle_successful_response'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:15:in `block in get'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:53:in `connect_with_retry'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:11:in `get'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/query/core.rb:100:in `execute'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/query/retention_policy.rb:26:in `alter_retention_policy'
/Rakefile:25:in `block in <top (required)>'
/usr/local/bundle/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Tasks: TOP => default => policies
(See full trace by running task with --trace)

Huh? I didn’t set up any retention policies? Searching for the error message, I find https://gitlab.com/gitlab-org/influxdb-management/issues/3. Well, at least it’s a known issue. InfluxDB’s default configuration autogenerates a default retention policy when a database is created. That retention policy holds the data for infinity which implies a shard data duration of 168h. Shard durations cannot be changed so I deleted that retention policy and created a new one with a 1h retention since that’s what these GitLab scripts try to do.

Finally, I can deploy at least a few of these dashboards (after modifying the datasource name) and they just work. Which ones are actually useful is still an exercise ahead of me.