Version updates and more information in the README about SFO.

Version 1.5.1.
parent cc95e2e2
......@@ -55,18 +55,18 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](
and this project adheres to [Semantic Versioning](, at least as far as an infrastructure project can.
## [Unreleased]
## [1.5.1] - 2021-04-11
### Added
- Add a daemonset to disable transparent huge pages on stable nodes (because they run databases)
- Log shipping from SFO to GKE via [](
### Changes
- Update Boxomon to v0.4.0
- Update []( to enabled static inline image caching and easy upstream updates with a submodule.
- Update []( to enabled static inline image caching and easy upstream updates with a submodule, as well as upstream updates.
- Tweaks to SQL DB deployment and tooling in [palpantlab-sql-dbs](./palpantlab-sql-dbs)
- Version bumps and config tweaks for: NGINX ingress
- Update GitLab to v13.4.1
- Update GitLab to v13.9.1
- Update kubernetes-ui/dashboard to v2.0.4
- Update Thanos to v0.15.0
- Update Thanos to v0.18.0
### Removed
## [1.5.0] - 2020-06-10
......@@ -304,7 +304,8 @@ Lastly, I have split up the single mono-repo into individual repos to support si
- HAProxy, all instances
- Most of the 9s in my previous uptime. But they will be back, and better than ever!
......@@ -64,7 +64,7 @@ I currently run two Kubernetes clusters: one GKE cluster in Google Cloud, and on
## palpantlab-gke-west1-b-01
This cluster is Google Kubernetes Engine (GKE) cluster with several adjacent pieces of infrastructure including Google CloudSQL databases and Google Cloud Storage for object storage.
Kubernetes Version: `1.16.9-gke.2`
Kubernetes Version: `1.18.16-gke.502`
Master IP: ``
......@@ -103,9 +103,9 @@ All of this actually started with me hosting a private GitLab instance, https://
Prior to the move to GKE there were periods of significant instability and high-risk of data loss. However, data is now stored on GCE Peristent Disks, with LFS artifacts and daily backups (via Kubernetes CronJob) stored to GCS with object versioning enabled, greatly reducing the risk of data loss.
GitLab cloudnative helm chart: `gitlab-4.0.5`
GitLab cloudnative helm chart: `gitlab-4.9.1`
GitLab version: `v13.0.6`
GitLab version: `v13.7.1`
#### GitLab CI Runner
In addition to Gitlab shared GitLab runners have been created for use with any project which is hosted on The shared runners use the Kubernetes executor to launch a build pod for each triggered build for projects that have a .gitlab-ci.yaml file in the root of their repository. By default these Kubernetes executors now use Kaniko as their base image, to avoid the need for Docker-in-Docker (DinD) which is not compatible with GKE on cos. The GitLab runner continously polls the GitLab server over the public internet (using a preshared token for authorization) for jobs, and executes thoes jobs by creating Kubernetes pods and `exec`ing into them to run the commands of the job.
......@@ -114,7 +114,7 @@ The GKE runner is labeled `gke`, `k8s`, `west1-b`.
For more information in Gitlab CI, see [here](
GitLab runner version: `gitlab/gitlab-runner:alpine-v13.2.4`
GitLab runner version: `gitlab/gitlab-runner:alpine-v13.9.0`
#### Personal Websites
I deploy a handful of static websites reliably on the cluster, providing reliability via replication and loadbalancing (via NGINX and Kubernetes) and low-latency (courtesy of Google Cloud's networking mostly). These include my personal website, []( (a Ghost-based stateful blog) and my brother's, [](, a Jekyll-based static blog.
......@@ -141,13 +141,13 @@ Kubernetes dashboard version: `v2.0.4`
GitLab deploys a nicely-configured monitoring stack as part of its recent Helm charts. This by default comes with an alertmanager, a Grafana server, and a configured Prometheus server. This Grafana pool uses a GCP CloudSQL instance as a backend, so it is possible to make live updates to the dashboard and have them saved for all instances in the Deployment. It also supports LDAP, Google OAuth, and GitLab OAuth for accounts. I added a public, authenticated ingress for the Prometheus server at []( The default alertmanager deployment did not configure HA, so I added this and now run a cluster of 3 HA alertmanagers as a StatefulSet. Lastly, I integrated [Thanos]( into this system - Thanos provides long-term storage and historical queries by backup up blocks from the Prometheus server's disk directly to GCS, and then several other components allow queries to be answered from GCS blocks. This is highly scalable - GCS is cheap and reliable and provides convenient export to other storage systems (in this case, Dropbox); the block-querying software (Thanos Store and Thanos Query) are expensive, but horizontally scalable to a high degree. With this, I am able to keep only the last week or two of data on the Prometheus instance itself, but can still make [long-range queries](,%22now%22,%22Thanos%22,%7B%22expr%22:%22avg(,%22context%22:%22explore%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D) (requires login).
stable/prometheus Chart: `10.4.0`
Thanos Store, Query, Bucket, and Sidecar: ``
Thanos Store, Query, Bucket, and Sidecar: ``
I also already had uptime monitoring checks configured via Stackdriver, a separate Google product. Stackdriver is great for this, and the uptime checks are frequent, geographically distributed, track latencies, and the alerting system is very powerful. Unfortunately, metric retention in Stackdriver is limited to 6 weeks only (42 days!!) - too short a time to see changes over time. The dashboarding and query language is also limited. To get around that, I deployed [stable/stackdriver-exporter]( to export these uptime check metrics from Stackdriver to Prometheus, where I have more control over retention and aggregation. stackdriver-exporter also supports exporting arbitrary Stackdriver GCP metrics (like log-derived metrics, compute metrics, SQL server metrics, etc.) - however, the API usage grows rapidly if you want to export these metrics on a regular basis, and Google Cloud charges a high cost for metric-read APIs if you exceed a threshold, so for now I am only exporting the uptime check metrics to limit API usage.
I recently added [Loki]( a scalable log-storage and search solution, to this cluster. It consumes all logs from the Kubernetes cluster via a daemon, `promtail`, that runs on every node, which ships logs directly to the Loki server. This server compresses, indexes, and storages these logs on a local disk (though object-storages solutions are also available). This appears to be very space-efficient: several months of logs takes up only a few gigabytes. Logs stored in Loki are queriable via Grafana natively as a data source.
To get logs from the SFO cluster was more difficult, but I recently enabled a public HTTPS endpoint for Loki with basic auth, and began to ship logs up to the cloud from SFO as well, so that I can view logs for all containers in any cluster via this same interface.
To get logs from the SFO cluster was more difficult. There is a public HTTPS endpoint for Loki with basic auth, and with this I can ship logs up to the cloud from SFO as well, so that I can view logs for all containers in both clusters.
#### Backups
I run several types of CronJob in GKE to make sure I don't lose data easily:
......@@ -181,7 +181,7 @@ A single-node Kubernetes master running "on-prem". This single-node "cluster" ha
One major limitation fo the "on-prem" compute is data access - GCP charges a premium for reading data out of the cloud, so if I were to run any task or service which needed to download data from the cloud to the SFO cluster, I would pay for that data egress: about $0.12/GB. This prevents me from running services like the Thanos Compactor on the cluster because they would pull so much data, unfortunately.
### Kubernetes Nodes
Ubuntu 20.04 running on Intel i9-9900k, custom build.
##### Hardware
......@@ -189,45 +189,61 @@ CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
Memory: 2x16GiB DDR4 2400MHz
Disk: Samsung SSD 970 EVO 250GB (nvme0n1); Samsung SSD 850 EVO 1TB (sda); Samsung SSD 840 PRO Series (sdb).
Disk: 2x Samsung SSD 970 EVO 500GB (nvme0n1, nvme1n1); Samsung SSD 850 EVO 1TB (sdb) + Samsung SSD 860 EVO 2TB (sda) in RAID1.
sda 8:0 0 931.5G 0 disk
└─ubuntu--vg-homes 253:5 0 931.5G 0 lvm /home
sdb 8:16 0 238.5G 0 disk
├─sdb1 8:17 0 400M 0 part
└─sdb2 8:18 0 238.1G 0 part
├─ubuntu--vg-root_rmeta_0 253:0 0 4M 0 lvm
│ └─ubuntu--vg-root 253:4 0 232.5G 0 lvm /
└─ubuntu--vg-root_rimage_0 253:1 0 232.5G 0 lvm
└─ubuntu--vg-root 253:4 0 232.5G 0 lvm /
nvme0n1 259:0 0 232.9G 0 disk
├─nvme0n1p1 259:1 0 232.5G 0 part
│ ├─ubuntu--vg-root_rmeta_1 253:2 0 4M 0 lvm
│ │ └─ubuntu--vg-root 253:4 0 232.5G 0 lvm /
│ └─ubuntu--vg-root_rimage_1 253:3 0 232.5G 0 lvm
│ └─ubuntu--vg-root 253:4 0 232.5G 0 lvm /
└─nvme0n1p2 259:2 0 400M 0 part /boot/efi
sda 8:0 0 1.8T 0 disk
└─md126 9:126 0 931.5G 0 raid1
├─md126p1 259:2 0 100M 0 part /boot/efi
├─md126p2 259:3 0 16M 0 part
├─md126p3 259:4 0 735.6G 0 part
├─md126p4 259:5 0 499M 0 part
└─md126p5 259:6 0 195.3G 0 part /
sdb 8:16 0 931.5G 0 disk
└─md126 9:126 0 931.5G 0 raid1
├─md126p1 259:2 0 100M 0 part /boot/efi
├─md126p2 259:3 0 16M 0 part
├─md126p3 259:4 0 735.6G 0 part
├─md126p4 259:5 0 499M 0 part
└─md126p5 259:6 0 195.3G 0 part /
nvme0n1 259:0 0 465.8G 0 disk /home
nvme1n1 259:1 0 465.8G 0 disk
Label: none uuid: 3451815e-07c2-4b60-bd43-68fd338aa881
Total devices 1 FS bytes used 168.69GiB
devid 1 size 195.31GiB used 173.03GiB path /dev/md126p5
Label: none uuid: af5e3ee6-40c6-4dc0-82f3-5f6a025f842c
Total devices 2 FS bytes used 83.62GiB
devid 1 size 465.76GiB used 86.03GiB path /dev/nvme0n1
devid 2 size 465.76GiB used 86.03GiB path /dev/nvme1n1
Drivers: 460.39, stable
$ nvidia-smi
Wed Sep 2 21:03:44 2020
Sat Apr 10 22:45:38 2021
| NVIDIA-SMI 450.56.06 Driver Version: 450.56.06 CUDA Version: 11.0 |
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 GeForce RTX 208... On | 00000000:01:00.0 Off | N/A |
| 53% 67C P2 128W / 130W | 1462MiB / 7979MiB | 97% Default |
| 54% 67C P2 124W / 130W | 683MiB / 7979MiB | 97% Default |
| | | N/A |
### Kubernetes infrastructure
This one-node Kubernetes cluster uses [microk8s](, which is easily installed in Ubuntu via `snap`. Enabled on this cluster are the following microk8s [addons]( dns, gpu, helm, metrics-server, rbac, storage.
Kubernetes version: `v1.18.17`
### Kubernetes-hosted services
#### HTTP reverse-proxy and LetsEncrypt
The DS920+ provides a simple HTTP reverse proxy functionality with LetsEncrypt certificate renewal and HTTPS termination. I use this to serve a few websites which generally have reasonable uptime, but poor latency.
......@@ -245,7 +261,7 @@ CloudSQL proxy version: ``
#### GitLab CI Runner
The GitLab runner which interacts with `palpantlab-sfo` is tagged with tags `k8s`, `sfo`, and `gpu`. It uses nvidia-device-drivers to provide access to the NVIDIA GPU directly to automated builds, should it be necessary (e.g. to run tests that use tensorflow-gpu). That runner is maintained with the rest of the GitLab deployment, in [palpantlab-gitlab](
GitLab runner version: `gitlab/gitlab-runner:alpine-v13.2.4`
GitLab runner version: `gitlab/gitlab-runner:alpine-v13.9.0`
#### transmission-web
I deployed the open-source project [haugene/docker-transmission-openvpn]( as a Kubernetes Deployment and Service on the on-prem cluster. It has access to the NAS as well as to the filesystem on the local node, and it has sufficient privilege to open an OpenVPN tunnel, so all torrent traffic goes through NordVPN. The deployment of this webapp is managed at [palpantlab-tranmission]( In the most recent change I added a public portal to the web UI. Previously the service was only accessible to people with 1) Kubernetes cluster credentials and 2) VPN credentials (which was only me). Now the service is public, but protected with [pusher/oauth2_proxy]( and can only be accessed with a Google email that belongs to the appropriate group. That web UI is accessible at []( (but of course the only thing the page shows is a "Sign in with Google" button).
......@@ -259,7 +275,7 @@ The Prometheus process uses a slice of disk via a LocalPersistentVolume, which i
stable/prometheus Chart: `10.4.0`
Thanos Sidecar: ``
Thanos Sidecar: ``
#### Folding@Home
I deployed [Folding@Home](, a distributed computing project, to run on the local cluster and utilize the RTX 2080 during the day, under [folding-at-home](
Subproject commit bd91e5460e4e0838e7937e36ef86b171938088d4
Subproject commit 07b6bbfe662d0d6209445d0b368515bc353c2641
Subproject commit 63c004e10e3c908e5992b58be6f8f8ec8b21bbb8
Subproject commit 7889ebbdfea38cc093216ce8b949b659edb953d8
Subproject commit 79e0f16e06df3cff1e38539c793c3c7e02cffac3
Subproject commit 6dcea5a6bce37216c88952e64f3ce8554abf25e6
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment