blip zip blog

so I stand a chance of remembering things

A brief summary of how I plan to get ntfy into a “production-ready” state so it can be exposed to the internet.

  • Access control
    • Deny by default
    • User(s) and/or access token(s) for:
    • Read-only use cases (i.e., mobile client)
    • Read-write use cases (i.e., alertmanager, alert scripts)
  • Fail2ban
    • Configuration to look for failed login attempts in /var/log/nginx/access.log
    • Block by IP – 24 hours
  • Websockets over HTTPS

I'll tackle these in order: ACL is required, fail2ban is necessary but not strictly required for day 1, and websockets are a nice-to-have.

At the moment, all my notifications (from my monitoring stack, and SSH login notifications) are sent via email, which is handled by a Power Automate notification. This works well enough, but I like the idea of handling everything in-house.

ntfy.sh looks like a really solid option. I've done the most basic of basic setups (installed via their package manager) on my production server and exposed it on this domain via my existing nginx reverse proxy. I haven't done enough reading to understand what the default config exposes from an administrative perspective, so for now I've added an IP allowlist to only permit traffic from my home IP.

There's a whole bunch of other config options to sort out, but right now, it looks like this:

nginx

server {
	server_name ntfy.blip.zip;
	allow ;
	deny all;
	location / {
		proxy_pass http://192.168.115.2:8081;
		proxy_set_header Host \$host;
		proxy_set_header X-Real-IP \$remote_addr;
		proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
		proxy_set_header X-Forwarded-Proto \$scheme;
	}
	location = /robots.txt { return 200 "User-agent: *\nDisallow: /\n"; }
}

nfty

# ntfy server config file
# Please refer to the documentation at https://ntfy.sh/docs/config/ for details.
base-url: "https://ntfy.blip.zip"

listen-http: ":8081"

# Debian/RPM package users:
#   Use /var/cache/ntfy/cache.db as cache file to avoid permission issues. The package
#   creates this folder for you.
#
# Check your permissions:
#   If you are running ntfy with systemd, make sure this cache file is owned by the
#   ntfy user and group by running: chown ntfy.ntfy <filename>.
#
cache-file: "/var/cache/ntfy/cache.db"

# If set, access to the ntfy server and API can be controlled on a granular level using
# the 'ntfy user' and 'ntfy access' commands. See the --help pages for details, or check the docs.
#
# - auth-file is the SQLite user/access database; it is created automatically if it doesn't already exist
# - auth-default-access defines the default/fallback access if no access control entry is found; it can be
#   set to "read-write" (default), "read-only", "write-only" or "deny-all".
# - auth-startup-queries allows you to run commands when the database is initialized, e.g. to enable
#   WAL mode. This is similar to cache-startup-queries. See above for details.
#
# Debian/RPM package users:
#   Use /var/lib/ntfy/user.db as user database to avoid permission issues. The package
#   creates this folder for you.
#
# Check your permissions:
#   If you are running ntfy with systemd, make sure this user database file is owned by the
#   ntfy user and group by running: chown ntfy.ntfy <filename>.
#
auth-file: "/var/lib/ntfy/user.db"
auth-default-access: "read-write"
# auth-startup-queries:

# If set, the X-Forwarded-For header is used to determine the visitor IP address
# instead of the remote address of the connection.
#
# WARNING: If you are behind a proxy, you must set this, otherwise all visitors are rate limited
#          as if they are one.
#
behind-proxy: true
enable-metrics: false
metrics-listen-http: ":9101"

In the writefreely section of this post I went out of my way to bind to the wireguard interface used to link my VPS gateway to the “production” machine.

Obviously it just needed to be bound to 0.0.0.0 and can now be reached over all the interfaces:

[server]
port                 = 8080
bind                 = 0.0.0.0

I might write this up in full as a useful instruction guide at some stage, but for now I'm documenting the eventual approach and the things that I tripped over along the way.

It's Prometheus, AlertManager, BlackboxExporterand Grafana, with nginx as a reverse proxy. Port 443 is only accessible over the LAN and wireguard networks. It's not exposed to the internet, and so no authn in place at this point.

NodeExporter is running on all of the monitored nodes, and BlackboxExporter is used to ping the hosts to see if they're alive (should NodeExporter be down).

AlertManager is configured to push alerts via webhook. At the moment, this is going to a MS Power Automate flow.

I've looked at ntfy.sh and will probably spin up a test instance soon.

Trips

  • I've written up much of this config on the fly previously, and I spent far too much time fixing problems I'd created ages ago.
  • Trying to run Prometheus and the associated services with a user other than nobody was frustrating. I'd planned to make the config hidden from users on the local monitoring box by default, but because Prometheus is running as nobody, I've had to chmod 644 (-rw-r--r--).
# Create config directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /etc/blackbox
sudo mkdir -p /etc/alertmanager

# Create prometheus rules file
cat <<EOF | sudo tee /etc/prometheus/rules.yml > /dev/null
groups:
  - name: host-down
    rules:
      - alert: Host down
        for: 1m
        expr: up{job="node_exporter"} == 0 or probe_success{job="blackbox"} == 0
        labels:
          severity: critical
        annotations:
          title: Host is down
          description: The host cannot be contacted
  - name: system-alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage is above 80% for the past 2 minutes on {{ $labels.instance }}."

      - alert: HighDriveSpaceUsage
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High disk space usage detected on {{ $labels.instance }}"
          description: "Disk usage is above 80% for the past 2 minutes on {{ $labels.instance }} (mountpoint: {{ $labels.mountpoint }})."

      - alert: HighCPUTemperature
        expr: node_hwmon_temp_celsius > 50
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU temperature detected on {{ $labels.instance }}"
          description: "CPU core temperature is above 50°C for the past 2 minutes on {{ $labels.instance }}."

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected on {{ $labels.instance }}"
          description: "Memory usage is above 80% for the past 2 minutes on {{ $labels.instance }}."

EOF

# Create promethus config file
cat <<EOF | sudo tee /etc/prometheus/prometheus.yml > /dev/null
global:
  evaluation_interval: 1m
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - alertmanager:9093
rule_files:
  - "rules.yml"
scrape_configs:
  - job_name: "node_exporter"
    static_configs:
      - targets:
        - prometheus:9090
        - nuc1.internal:9100
        - nuc2.internal:9100
        - nuc3.internal:9100
        - gateway.blip.zip:9100
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - nuc1.internal
        - nuc2.internal
        - nuc3.internal
        - gateway.blip.zip
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
EOF

# Create blackbox config file
cat <<EOF | sudo tee /etc/blackbox/blackbox.yml > /dev/null
modules:
  tcp_connect:
    prober: tcp
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5
EOF

# Read in value for webhook URL
read -p "Enter webhook URL for alertmanager: " webhook_url_alertmanager

# Create alertmanager config file
cat <<EOF | sudo tee /etc/alertmanager/alertmanager.yml > /dev/null
global:
  resolve_timeout: 1m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 24h
  receiver: 'powerautomate'

receivers:
- name: 'powerautomate'
  webhook_configs:
  - url: '$webhook_url_alertmanager'
EOF


sudo chmod 755 /etc/prometheus
sudo chmod 644 /etc/prometheus/*

sudo chmod 755 /etc/blackbox
sudo chmod 644 /etc/blackbox/*

sudo chmod 755 /etc/alertmanager 
sudo chmod 644 /etc/alertmanager/*


# Create nginx config. We're assuming that certs already exist in /etc/nginx/certs.
cat <<EOF | sudo tee /etc/nginx/nginx.conf > /dev/null
events { 
	worker_connections 1024;
}
http {
	
	server {
		listen 443 ssl;
		server_name prometheus.blip.zip;
		ssl_certificate_key     /etc/nginx/certs/key.pem;
		ssl_certificate         /etc/nginx/certs/cert.pem;
		ssl_protocols           TLSv1.2 TLSv1.3;

		location / {
				proxy_pass http://prometheus:9090;
		}
	}

	server {
		listen 443 ssl;
		server_name alertmanager.blip.zip;
		ssl_certificate_key     /etc/nginx/certs/key.pem;
		ssl_certificate         /etc/nginx/certs/cert.pem;
		ssl_protocols           TLSv1.2 TLSv1.3;

		location / {
				proxy_pass http://alertmanager:9093;
		}
	}

	server {
		listen 443 ssl;
		server_name grafana.blip.zip;
		ssl_certificate_key     /etc/nginx/certs/key.pem;
		ssl_certificate         /etc/nginx/certs/cert.pem;
		ssl_protocols           TLSv1.2 TLSv1.3;

		location / {
				proxy_pass http://grafana:3000;
				proxy_set_header Host \$host;
				proxy_set_header X-Real-IP \$remote_addr;
				proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
				proxy_set_header X-Forwarded-Proto \$scheme;
		}
	}
}
EOF

# Create docker-compose file for the whole monitoring stack
sudo mkdir -p /etc/docker-scripts/monitoring
cat <<EOF | sudo tee /etc/docker-scripts/monitoring/docker-compose.yml > /dev/null
---
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: always
    volumes:
      - prometheus_data:/prometheus
      - /etc/prometheus:/etc/prometheus:ro
    ports:
      - '9090'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: always
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - '3000'
    networks:
      - monitoring
    depends_on:
      - prometheus

  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: always
    volumes:
      - /etc/blackbox:/etc/blackbox:ro
    command:
      - '--config.file=/etc/blackbox/blackbox.yml'
    ports:
      - '9115'
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: always
    volumes:
      - alertmanager_data:/alertmanager
      - /etc/alertmanager:/etc/alertmanager:ro
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - '9093'
    networks:
      - monitoring

  nginx:
    image: nginx:latest
    container_name: nginx
    restart: always
    volumes:
      - /etc/nginx/:/etc/nginx/:ro
    ports:
      - '443:443'
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "10.60.21.0/24"
          gateway: "10.60.21.1"
EOF

It turns out the changes to the Porkbun DNS API URL were update in the pfsense acme package, but they don't seem to have made their way into the one that's being downloaded by my firewall.

I've given up trying to look how pfsense package inheritance works, and resorted to just updating the script in the acme package myself.

chmod 775 /usr/local/pkg/acme/dnsapi/dns_porkbun.sh
sed -i 's/https:\/\/porkbun.com\//https:\/\/api.porkbun.com\//g' /usr/local/pkg/acme/dnsapi/dns_porkbun.sh
chmod 555 /usr/local/pkg/acme/dnsapi/dns_porkbun.sh

All works a treat now. It might break if the package gets reinstalled or anything, but it'll do for now.

Considering using pfsense as a place to automate cert issuance and pushing them out from here.

There's an acme.sh package for pfSense, which looks like a zero-maintenance option for keeping certs in place by using the Porkbun DNS API.

I've used acme.sh with Porkbun before, so I'm confident in that respect. However, when trying to test it out against the LetsEncrypt staging platform, it's bombing out.

Let's have a look at the logs:

[Mon Jan 20 21:08:03 UTC 2025] POST
[Mon Jan 20 21:08:03 UTC 2025] _post_url='https://porkbun.com/api/json/v3/dns/retrieve/_acme-challenge.pfsense.blip.zip'
[Mon Jan 20 21:08:03 UTC 2025] body='{"apikey":"...","secretapikey":"..."}'
[Mon Jan 20 21:08:03 UTC 2025] _postContentType
[Mon Jan 20 21:08:03 UTC 2025] Http already initialized.
[Mon Jan 20 21:08:03 UTC 2025] _CURL='curl --silent --dump-header /tmp/acme/pfsense-blip-zip-prod/http.header  -L  -g '
[Mon Jan 20 21:08:04 UTC 2025] _ret='0'
[Mon Jan 20 21:08:07 UTC 2025] response='<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
'/html>

The API hostname was updated to api.porkbun.com some time ago, and maintenance of the old one was only kept in place until 2024-12-01.

The Porkbun DNS script in acme.sh was updated to use the correct domain back in October, but the upstream changes haven't made their way into the pfsense package yet.

If I find time tomorrow, I'll have a look for the source and see if there are plans to merge it in.

I've never transferred a domain before. After working out how to use Porkbun's DNS for external domains I'm fairly sure it should go through without a hitch.

Let's see how this goes...

Had another stab at it, and made sure that the Porkbun DNS was actually working before I updated the nameservers. Everything's behaving itself nicely now, so I'll do a bit more reading about domain transfers and make sure I have all my ducks in a row.

A series of cockups

My main domain that I use for personal stuff is up for renewal, and the price has gone up dramatically. As the WHOIS data shows, blip.zip is registered with Porkbun, and their transfer price is fairly reasonable.

Porkbun allow for domains to be registered externally to allow their nameservers to be used. Before I transferred, I planned to update the NS to use Porkbun instead.

I added the domain, duplicated all my DNS records from my existing provider, double-checked the lot, and updated the nameservers for the domain.

Cockup #1: I didn't check that the new nameservers were returning records for the domain before I changed

After waiting half an hour for stuff to start working, and no records being resolved by the new nameservers, I chickened out (not wanting to risk mail not being received etc.) and hastily reverted the change.

Cockup #2: I didn't check that the records that existed prior to the NS change had been retained. Spoiler: they had not.

After watching TV for a couple of hours, I sent a test email to see if would be received properly. It wasn't. Domain provider had just re-implemented the defaults (i.e., nowt). Thankfully I'd kept an export of the raw data so restoring only took a few seconds.

So why was nothing being returned by Porkbun's DNS?

I'd set up all the records properly, and the NS values were all correct. I'd failed to actually check the options on the console and checked the “Enable DNS” button.

Live and learn and get things wrong and probably get them wrong again and hopefully learn again.