For me, it all kicked off when I started a new job working from home - within the first couple of months, I started to get really annoyed with how shitty my home internet was. Multiple times a day it would go down completely for 2-3 minutes at a time, which sucks when you're in the middle of a video chat, trying to deploy, or just talk in Slack.
To get to the bottom of it, I did what I would do at work - collect as many metrics as possible and try to identify the root cause. If you don't want to read a long essay about how it's all set up, click here to see a snapshot of the dashboard.
At the time, my network looked like:
If I had to do it over again, I would have bought a Unifi USG instead of the EdgeRouter, but I bought it a long time ago before I knew about the magic of Unifi (more on this later).
The EdgeRouter connects to the DG1670 which is set to bridge mode, and then the switch and Unifi AP connect to the two EdgeRouter ports left, each one of them in their own subnet.
Back when I worked at DigitalOcean, we did almost all of our metrics collection with Prometheus, so I decided to write or use some collectors to get what I needed. Similarly, we used Grafana to show off those metrics, so I spun up an instance of Grafana Cloud and port forwarded an instance of
nginx in front of Prometheus with basic auth, so I could access it from the internet.
This was the most annoying - it turns out that there aren't a ton of people that care enough to write Prometheus exporters for their $70 cable modem, so I wrote one of my own: https://github.com/nickvanw/dg1670a_exporter
I ran that on my local Linux machine, then added it to be scraped by the local Prometheus:
scrape_configs: ... - job_name: 'modem' static_configs: - targets: ['localhost:9191'] ...
Unfortunately, the outcome of all of this work was that I replaced this modem with a new one, so I don't have an example of what the metrics look like. The new modem exposes almost the same set, and it looks like this:
This information can be really useful: if any numbers are out of normal range, you probably should call your ISP and have them send someone out, it really can make a difference.
This was much easier - I originally had planned to scrape the EdgeRouter web API, but that uses websockets and was just too much work. Instead, I enabled SNMP in the UI:
Then I used the Prometheus SNMP Exporter. SNMP is weird and requires a some configuration, you can grab the config that I use here. This will collect basic network metrics defined in RFC 2863 about each interface.
Run the exporter with that config:
Configure that exporter in Prometheus:
scrape_configs: ... - job_name: 'snmp' static_configs: - targets: - $ROUTER_URL metrics_path: /snmp params: module: [default] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9116 ...
This will get you raw network metrics for each port of your EdgeRouter:
At some point I want to collect CPU and Memory metrics too, but with network offload, it's never been a problem.
Ubiquiti is coming out with UNMS, which is supposed to be a centralized management software for EdgeRouter equipment, but right now it's in Alpha and basically non-deployable: they give you a huge bash script that sets up a million containers with Docker Compose, which is not something I really am interested in running.
This is the fun part - instead of using SNMP to collect information from my Unifi AP, I used @mdlayher's unifi_exporter, which scrapes from the Unifi controller. I run that on a server somewhere using @jessfraz's Unifi Controller Container, which is super clean and works perfectly. I have traefik front the Web UI, which uses Lets Encrypt to generate a valid certificate and proxies back to the controller.
This exporter is configured with a simple yaml file:
listen: address: :9130 metricspath: /metrics unifi: address: $CONTROLLER_URL username: $CONTROLLER_USERNAME password: $CONTROLLER_PASS site: insecure: false timeout: 5s
unifi_exporter -config.file config.yaml, add it to Prometheus (same as the rest, just at a different port) and bask in the metrics:
- signal strength and noise for every device
- throughput for each AP
- number of clients connected per radio
There are too many to screenshot, so check out the second, third and forth rows in this snapshot.
Overall Connectivitiy (SmokePing)
When I was working from home, I would basically keep a
ping window open on my laptop all the time - when my internet would go down, I would tab over to it and watch all of my pings timeout until my internet would come back. Maddening.
I wanted to get these graphs into the same place as everywhere else, so I went ahead and set up an Influx server on a DigitalOcean Droplet and tried to run infping to send metrics from
fping. Unfortunately, that codebase is a bit old, and Influx changed some stuff from 0.9, so I forked it from another fork and updated it to work the way I wanted: https://github.com/nickvanw/infping. It's since been forked by other people, and I've brought some of their changes back to make it work over TLS and some other stuff.
Woo! Check this out:
I've probably written too much already! Part 2 is going to cover how I found out what was wrong with my network, and some of the improvements I made, like adding a second AP in my apartment, and buying a different cable modem because of some hardware / software flaws in the Arris equipment.
If you're curious about anything else in here, feel free to e-mail me at
me [at] nvw.io or reach out on Twitter (@nickvanwig).