Renewing DHCP while keeping a NetworkManager connection up

TL;DR

ps ax | grep dhclient | grep <interface>
kill <pid-of-dhclient>
nmcli device reapply <interface>

Details

I recently changed my DNS setup on my home network and wanted to move all the clients over to the new nameserver before turning down the old one.  My DHCP server was already updated to hand out the new DNS nameserver so all I needed to do was refresh the clients.

Due to a few other changes in my network, I had a multi-hour file transfer going on between two machines. A common solution is to just bring the interface down and back up:

nmcli con down <id>
nmcli con up <id>;

Doing so would interrupt the file transfer which I really, really wanted to avoid. Searching the web only turned up a chorus of “just restart the interface.”  Not good enough.

One other approach is to simply restart dhclient:

sudo dhclient -r <interface>
sudo dhclient <interface>

While that is valid for some system configurations, it doesn’t work if NetworkManager is involved.  With NetworkManager, a copy of dhclient is started per-interface by NetworkManager rather than a single system-wide instance via an init script or systemd unit. Even if you tell dhclient which interface to release, it won’t end up stopping the instances launched by NetworkManager. What do we do?

Stopping dhclient is the easy part:

  1. Lookup the pid of the instance for the interface:
    ps ax | grep dhclient | grep <interface>
  2. Kill it:
    kill <pid>

Even though dhclient will stop, the interface is still up, it has an IP, routes are still valid, and the old DNS configuration is still in use. All that changes with dhclient stopped is DHCP lease renewals won’t happen. All good so far.  What about getting dhclient restarted? We could manually restart dhclient with the same arguments that NetworkManager used previously but that wouldn’t preserve the parent/child relationship NetworkManager had previously. We need someway to tell NetworkManager to restart dhclient. (aside: I was rather aghast that NetworkManager doesn’t monitor its dhclient instances and restart them automatically. What happens if dhclient terminates for any other reason?)

Buried deep in the nmcli man page is the nmcli device reapply command. NetworkManager already knows what the intended configuration is and that hasn’t changed. All it needs to do is reapply the configuration to the device. Since the device is configured to use DHCP, a new dhclient instance will be started which will immediately request updated lease information from the DHCP server. Since the lease IP and route hasn’t changed and the interface already has those configured, only the DNS nameserver will be changed. IPv4 connectivity and the interface as a whole are uninterrupted.

Monitoring GitLab on Kubernetes

TL;DR

If you want decent monitoring of GitLab deployed via GitLab’s helm charts, skip the built-in Prometheus config and go straight to InfluxDB. Make sure you enable InfluxDB’s UDP interface and replace the autogenerated, default retention policy with a 1h policy. Use GitLab’s Grafana dashboards by using jq to extract the dashboard node from each .json file and then changing the datasource name to match yours.

Distributed systems have pros and cons

A few weeks ago, I migrated our self-hosted GitLab Omnibus install to a Kubernetes deployment using GitLab’s helm charts. Since moving everyone over to using this instance for their day-to-day activities, the whole system has fallen over once or twice.  The beauty of putting GitLab on Kubernetes is that horizontal autoscaling will spin up additional app servers as load increases.  The downside is that there are a lot more pieces that can fail compared to an Omnibus install.

After the 2nd outage, I realized I needed better ways to get data about the state of the system.  As it was, I had to guess what was going wrong based on the observed end behavior.  Well, GitLab’s chart sets up Prometheus monitoring out of the box.  Let’s see what we can do with that.

Skip Prometheus, go straight to InfluxDB

At first glance, not much.  Prometheus was happily scraping all the GitLab pods and storing the data in time series but there was no dashboard or charts.  The only available option was using Prometheus’s built-in Expression Browser which is oriented toward ad-hoc queries.  I took the suggestion from Prometheus’s and GitLab’s docs and deployed Grafana to handle data visualization.

I can’t be the only person wanting to monitor GitLab installs, right? GitLab’s documentation points you to https://gitlab.com/gitlab-org/grafana-dashboards which has a dashboard for monitoring Omnibus installs. GitLab chart deployments aren’t quite the same as Omnibus but they’re kinda close. Not close enough. This dashboard provides only a high-level overview of process uptime and it doesn’t seem to work with how Prometheus is labeling the data.

The other folder in that repo is a pile of dashboards used by GitLab.com to monitor their public infrastructure. If it’s good enough for them, maybe it’ll be good enough for me.  At first, I couldn’t get any of the dashboards to import. Reading the README, these are formatted as API requests rather than just the dashboard. A quick pass through jq to extract the dashboard node and I can import them into Grafana!  Woohoo!

Uh, all of the queries are failing. Peeking at a few, it seems GitLab.com has evolved from using Prometheus to InfluxDB and the dashboards have a mix of the two. Most of the more detailed dashboards seem to use InfluxDB so let’s go that route. After deploying InfluxDB via Helm (and turning on the UDP API and creating a database) and telling GitLab to send data there (https://docs.gitlab.com/ee/administration/monitoring/performance/gitlab_configuration.html), I can see measurements arriving in InfluxDB. Add InfluxDB as a datasource to Grafana, change the datasource name in the dashboards, and a few of the charts start to work!  But not all of them.

Diving in further to these queries, it seems GitLab.com uses InfluxDB’s retention policies and continuous queries rather extensively. These dashboards depend on the latter for a variety of things.  How am I going to recreate these? Poking around on GitLab’s repos, I stumbled across https://gitlab.com/gitlab-org/influxdb-management which seems to set all of this up. First attempt at running it throws an error:

rake aborted!
InfluxDB::QueryError: retention policy duration must be greater than the shard duration
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:91:in `handle_successful_response'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:15:in `block in get'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:53:in `connect_with_retry'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/client/http.rb:11:in `get'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/query/core.rb:100:in `execute'
/usr/local/bundle/gems/influxdb-0.3.5/lib/influxdb/query/retention_policy.rb:26:in `alter_retention_policy'
/Rakefile:25:in `block in <top (required)>'
/usr/local/bundle/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Tasks: TOP => default => policies
(See full trace by running task with --trace)

Huh? I didn’t set up any retention policies? Searching for the error message, I find https://gitlab.com/gitlab-org/influxdb-management/issues/3. Well, at least it’s a known issue.  InfluxDB’s default configuration autogenerates a default retention policy when a database is created. That retention policy holds the data for infinity which implies a shard data duration of 168h. Shard durations cannot be changed so I deleted that retention policy and created a new one with a 1h retention since that’s what these GitLab scripts try to do.

Finally, I can deploy at least a few of these dashboards (after modifying the datasource name) and they just work.  Which ones are actually useful is still an exercise ahead of me.