Updates to websites, adding Prometheus to SFO, and README.

parent eea7c9ac
images/** filter=lfs diff=lfs merge=lfs -text
......@@ -7,37 +7,41 @@
- [Added](#added)
- [Changed](#changed)
- [Removed](#removed)
- [1.3.0 - 2020-01-04](#130-2020-01-04)
- [1.4.0 - 2020-01-11](#140-2020-01-11)
- [Added](#added-1)
- [Changes](#changes)
- [Changed](#changed-1)
- [Removed](#removed-1)
- [1.2.0 - 2019-11-13](#120-2019-11-13)
- [1.3.0 - 2020-01-04](#130-2020-01-04)
- [Added](#added-2)
- [Changes](#changes-1)
- [Changes](#changes)
- [Removed](#removed-2)
- [1.1.1 - 2019-09-24](#111-2019-09-24)
- [1.2.0 - 2019-11-13](#120-2019-11-13)
- [Added](#added-3)
- [Changes](#changes-2)
- [Changes](#changes-1)
- [Removed](#removed-3)
- [1.1.0 - 2019-08-23](#110-2019-08-23)
- [1.1.1 - 2019-09-24](#111-2019-09-24)
- [Added](#added-4)
- [Changes](#changes-3)
- [Changes](#changes-2)
- [Removed](#removed-4)
- [1.0.0 - 2019-07-01](#100-2019-07-01)
- [1.1.0 - 2019-08-23](#110-2019-08-23)
- [Added](#added-5)
- [Changes](#changes-4)
- [Changes](#changes-3)
- [Removed](#removed-5)
- [0.2.0 - 2018-03-05](#020-2018-03-05)
- [1.0.0 - 2019-07-01](#100-2019-07-01)
- [Added](#added-6)
- [Changes](#changes-5)
- [Changes](#changes-4)
- [Removed](#removed-6)
- [0.1.1 - 2017-12-05](#011-2017-12-05)
- [0.2.0 - 2018-03-05](#020-2018-03-05)
- [Added](#added-7)
- [Changes](#changes-5)
- [Removed](#removed-7)
- [0.1.1 - 2017-12-05](#011-2017-12-05)
- [Added](#added-8)
- [Changes](#changes-6)
- [0.1.0 - 2017-11-25](#010-2017-11-25)
- [Added](#added-8)
- [Added](#added-9)
- [Changes](#changes-7)
- [Removed](#removed-7)
- [Removed](#removed-8)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
......@@ -49,13 +53,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
## [Unreleased]
### Added
### Changed
### Removed
## [1.4.0] - 2020-01-11
### Added
- [Cronjobs](https://gitlab.palpant.us/justin/palpantlab-sql-dbs/blob/master/deploy/kubectl-apply/gke/backup-cronjob.yaml) to run hourly backups for CloudSQL databases
- Added a basic secret backup-to-GCS script, since I've been keeping a lot of useful secrets locally.
- Added a CronJob to run `thanos bucket verify` and ensure data is being uploaded correctly by the sidecar and compactor.
- Deployed Prometheus on ubuntu-node-01 with upload to Thanos store.
### Changed
- Overdue README updates
- ubuntu-node-01.sfo.palpant.us GPU upgrade to NVIDIA RTX 2080 Super.
- Prevented thanos-compact from running in parallel
- Increased personal websites replicas (downtime due to preemption was too high with two replicas) and tweaked build specs with some of what I've learned since I touched them
- Cut back Thanos Store replicas and increase resources.
### Removed
## [1.3.0] - 2020-01-04
......@@ -230,8 +242,9 @@ Lastly, I have split up the single mono-repo into individual repos to support si
- HAProxy, all instances
- Most of the 9s in my previous uptime. But they will be back, and better than ever!
[Unreleased]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.3.0...HEAD
[1.2.0]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.2.0...v1.3.0
[Unreleased]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.4.0...HEAD
[1.4.0]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.3.0...v1.4.0
[1.3.0]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.2.0...v1.3.0
[1.2.0]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.1.1...v1.2.0
[1.1.1]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.1.0...v1.1.1
[1.1.0]: https://gitlab.palpant.us/justin/palpantlab-infra/compare/v1.0.0...v1.1.0
......
......@@ -3,6 +3,8 @@
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*
- [What Is It?](#what-is-it)
- [How it's going](#how-its-going)
- [Before the cloud](#before-the-cloud)
- [Clusters](#clusters)
- [palpantlab-gke-west1-b-01](#palpantlab-gke-west1-b-01)
- [Kubernetes Nodes](#kubernetes-nodes)
......@@ -15,6 +17,8 @@
- [Boxomon](#boxomon)
- [CloudSQL databases proxies](#cloudsql-databases-proxies)
- [Kubernetes Dashboards](#kubernetes-dashboards)
- [GitLab-based monitoring (Prometheus/Alertmanager/Grafana/Thanos)](#gitlab-based-monitoring-prometheusalertmanagergrafanathanos)
- [Backups](#backups)
- [palpantlab-sfo](#palpantlab-sfo)
- [Kubernetes Nodes](#kubernetes-nodes-1)
- [193.168.0.31/ubuntu-node-01.sfo.palpant.us](#193168031ubuntu-node-01sfopalpantus)
......@@ -37,6 +41,7 @@
- [CloudSQL databases proxies](#cloudsql-databases-proxies-1)
- [GitLab CI Runner](#gitlab-ci-runner-1)
- [transmission-web](#transmission-web)
- [Prometheus/Thanos/Alertmanager](#prometheusthanosalertmanager)
- [Copyright and License](#copyright-and-license)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
......@@ -46,10 +51,21 @@ This project will hold the files I use to describe the infrastructure for hostin
With version 1.0.0 of this repo and the changes of the last year, I have migrated to a hybrid-cloud approach. The on-prem deployment has been greatly simplified, with all but on node of the Kubernetes cluster removed. A Google Cloud Kubernetes cluster has been created (using GKE) which uses a flexible number of preemptible nodes to run most of the services which were running on-prem. Many services that were needed to make the on-prem cluster more useful (like dynamic storage) have been shut down, as GCP provides more stable versions of these. These two clusters are tied together and all services on them are managed with configuration-as-code via GitLab, which no runs as a cloud-native service in GKE. This changed was motivated by a desire for increased stability and less hands-on life-support, which I found myself needing to do a lot of as I got hit with hardware and storage failures due to the cheapness of my hardware.
Previously, I had set up a three-node Kubernetes cluster on an odd collection of hardware I had lying around. Persistence was handled through NFS, iSCSI, and a local Ceph installation. Ingress was handled with nginx-ingress, and kube-lego was set up to generate certificates. Monitoring was done with Prometheus Operator, which bundles scalable Prometheus deployments, highly available alertmanager, and stateless Grafana while adding some Custom Resource Definitions that abstract away scraping configs. Gitlab was running in the cluster, along with a number of websites.
See the [CHANGELOG](CHANGELOG.md) for what has been done so far, and when.
## How it's going
Moving to a multi-node cloud environment, combined with alerting, helped increase availability of all the simple sites immediately. With preemptible nodes, restarts are frequent, so sites with slower startup times or expensive/inconsistent initialization still suffered lower availability and longer outages. GitLab frequently went down, but deploying it as a helm chart with most components distributed increased reliability of GitLab further. At this stage, generally, major outages were avoidable and availability hovered around 99.8% - daily, weekly, or monthly - largely due to small outages caused by preemptions. While adding more nodes and replicas for each service would reduce the size of the outages (reducing the proportion of each service affected by preemption), that comes at direct cost, and converting entirely to non-preemptible nodes would increase costs by almost 300%. Instead, changes to node types (putting some services on a small but more expensive pool of non-preemptible nodes), replica counts, pod anti-affinity, using burstable cronjobs instead of always-on services, and tools to lessen the impact of preemption like the [k8s-node-termination-handler](https://github.com/GoogleCloudPlatform/k8s-node-termination-handler) and the [estafette-gke-preemptible-killer](https://github.com/estafette/estafette-gke-preemptible-killer/) have further increased availability,
[![Availability averages](/images/availability.png "Availability")*Availability has increased to generally >=99.9% for most services, and continues to increase.*][https://gitlab.palpant.us/grafana/d/i8erISaZk/availability?orgId=1]
These changes actually improved node utilization - CPU and memory requests generally total >85% of total allocatable resources. Increased monitoring with Prometheus and Grafana has also helped - it's far easier to see the memory a service is really using and tune its allocation, and to see if it is being CPU-limited and if so, with what pattern (on startup, consistently, under load only, etc.).
[![Chart showing resource utilization in GKE cluster](/images/resource-utilization.png "Resource Utilization")*Resource requests generally exceed 80% of available resources in GKE*][https://gitlab.palpant.us/grafana/d/4XuMd2Iiz/cluster-status?orgId=1&refresh=30s]
## Before the cloud
Previously, I had set up a three-node Kubernetes cluster on an odd collection of hardware I had lying around. Persistence was handled through NFS, iSCSI, and a local Ceph installation. Ingress was handled with nginx-ingress, and kube-lego was set up to generate certificates. Monitoring was done with Prometheus Operator, which bundles scalable Prometheus deployments, highly available alertmanager, and stateless Grafana while adding some Custom Resource Definitions that abstract away scraping configs. Gitlab was running in the cluster, along with a number of websites.
# Clusters
## palpantlab-gke-west1-b-01
......@@ -126,6 +142,15 @@ GitLab deploys a nicely-configured monitoring stack as part of its recent Helm c
I also already had uptime monitoring checks configured via Stackdriver, a separate Google product. Stackdriver is great for this, and the uptime checks are frequent, geographically distributed, track latencies, and the alerting system is very powerful. Unfortunately, metric retention in Stackdriver is limited to 6 weeks only (42 days!!) - too short a time to see changes over time. The dashboarding and query language is also limited. To get around that, I deployed [stable/stackdriver-exporter](https://github.com/helm/charts/tree/master/stable/stackdriver-exporter) to export these uptime check metrics from Stackdriver to Prometheus, where I have more control over retention and aggregation. stackdriver-exporter also supports exporting arbitrary Stackdriver GCP metrics (like log-derived metrics, compute metrics, SQL server metrics, etc.) - however, the API usage grows rapidly if you want to export these metrics on a regular basis, and Google Cloud charges a high cost for metric-read APIs if you exceed a threshold, so for now I am only exporting the uptime check metrics to limit API usage.
#### Backups
I run two types of CronJob in GKE to make sure I don't lose data easily:
- Backup of CloudSQL database to GCS
- Backup of GCS to Dropbox
For CloudSQL backups, a [set of Cronjobs](https://gitlab.palpant.us/justin/palpantlab-sql-dbs/blob/master/deploy/kubectl-apply/gke/backup-cronjob.yaml) runs hourly against each database, using the [export](https://cloud.google.com/sql/docs/postgres/import-export/exporting#cloud-sql) API to tar and gzip the database into a GCS bucket. These run hourly because the databases are frequently updated - Grafana and GitLab both write changes to the databases - and because the amount of data in each backup is extremely small and inexpensive.
Because I have a lot of Dropbox storage, I also added a [set of Cronjobs](https://gitlab.palpant.us/justin/rclone-to-dropbox) that run rclone to copy data out of GCS into Dropbox. This is actually fairly expensive due to data egress from GCP - the most expensive job by far is copying all of my accumulated Prometheus data to Dropbox. This cost was reduced somewhat by running the Thanos Compactor every day, while running the rclone jobs only weekly, so that less raw data is copied.
## palpantlab-sfo
A single-node Kubernetes master running "on-prem". This single-node cluster has low-reliability (I frequently reboot it and use it as a personal computer), but is very powerful for simple jobs and those where occasional failures are not significant, providing 16 modern CPU cores, 32 GiB of RAM, access to a NAS with 3TiB of RAID1 storage, and access to a powerful GPU - an NVIDIA RTX 2080 Super.
......@@ -285,6 +310,11 @@ I deployed the open-source project [haugene/docker-transmission-openvpn](https:/
After finding a bug with `pusher/oauth2_proxy` handling of group membership in Google groups, I forked the project [here](https://gitlab.palpant.us/justin/oauth2_proxy), set up CI and builds, and implemented a change to support better group detection. That change has been PRed against the upstream and merged in [PR#224](https://github.com/pusher/oauth2_proxy/pull/224).
#### Prometheus/Thanos/Alertmanager
I deployed [stable/prometheus](https://gitlab.palpant.us/justin/palpantlab-gitlab/blob/master/deploy/helm-upgrade/sfo/prom-values.yaml) into the SFO cluster to be able to collect metrics from that server and any processes running there. The GKE Prometheus can't easily run scrapes against the SFO cluster, and even if it could, the latency would be atrocious. With Thanos' ability to serve metrics from multiple clusters as long as the Prometheus sets a unique set of external labels, I can easily query metrics from SFO from Thanos as long as the metrics are old enough that they've been uploaded to GCS by the Sidecar component. Upload to GCS is free, so there's no additional cost. I can also run an Alertmanager in SFO to make sure to fire alerts as needed, although they're harder to triage and less valuable (outages in SFO are likely because I reboot the server, in which case Prometheus and Alertmanager aren't going to be running).
The Prometheus process uses a slice of disk via a LocalPersistentVolume, which is just a HostPath folder essentially, but that gives it a bit of resistance to losing data when I reboot.
# Copyright and License
Copyright 2020 Justin Palpant
......
Subproject commit 3283daf2745d5d6e8b22781c907038ff63a78522
Subproject commit f4d5a23f266b581254e84a86ea2ea05c0ecd5fd9
Subproject commit 36fb7bd089276022e5922a9331590d73beada128
Subproject commit a5a56c3a871dea22293ee4c60a3c4fdc3f777ca8
Subproject commit bba67fa500d259b6946364856c1b976f59535a49
Subproject commit 8e70b9cb1e6239b80601a5c4b8e26b6ff1fe7d34
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment