7 min read
My 3 Node Proxmox Cluster Lasted a Month

Deciding to Cluster

I have been slowly joining the movement of “de-googling” my life. While this is quite slow going, as convenience is still a requirement for myself, I have made some further progress with keeping more of my data local. With this shift though, my homelab was starting to be more like my own “datacenter”.

I use datacenter loosely but in my mind now that I was running apps that I truly needed to have up and running, it was time to shift my mentality of how I treated my homelab. So with wanting to follow better practices and focus on uptime, reliability, and expansion, I spun up my datacenter VLAN and then went to planning.

Problem one: Don’t route your storage

Tom Lawrence of Lawrence Systems has said and written “don’t route your storage” a non-insignificant amount of times. Due to that, I wanted to make sure I wasn’t falling for that blunder as well. Being that my current NAS is sitting in the homelab VLAN and the cluster would be datacenter VLAN, this needed to be researched and solved first as my VM disks are hosted on the NAS.

Turns out, this really isn’t that difficult. In my case, I decided to create a “storage” network that will be used for most of the VM disk related traffic, such as the boot drives and doing backups. So with that VLAN created, I added the interface in Proxmox nodes and on the NAS, which got the right IP for the storage subnet.

Now I could expand to multiple nodes all connecting back to one NAS, with the upgrade path just being to get some sort of 10Gb aggregation switch, and since it’s all the same subnet it would only need to support layer 2.

Post Cluster

Proxmox documentation and the migration from my single node, to 3 Dell R420s and a R620 via just restoring from my backups was super easy and painless. The only issue was that the nodes were only 1Gb NICs. This made transfers slow, and once the node had a VM running, the transfers would sometimes cause delays for the VM or become unresponsive. I didn’t bother to look too much further into it as once all the transfers were complete I didn’t run into this again.

So now I had a working cluster with shared network storage. I set a few of the VMs as being high availability (HA) and then proceeded to shutdown one of the nodes. After a little bit of time, sure enough, the VMs would appear on the other nodes! The time from the VM being down till it was back up wasn’t all that long, but it would certainly be a noticed outage.

I then tested the live migration feature, which was awesome to see only one dropped ping to the VM during the transfer. This means that if I wanted higher uptime, I should migrate first rather than rely on the HA health checks and failover process.

Problem 2: Power Draw and Removing a Node

Kind of a “duh” moment, but the power draw of 4 1U servers for about 12-18 VMs depending what’s running is overkill. I decided to just remove the R620, as while it’s 192Gb of RAM and 24 cores was a beast, each R420 had 64Gb of RAM and 12 cores themselves.

After powering down the R620 I saw about a 110 or so watt decrease on the UPS. Still not great, but certainly better. This meant that I needed to now remove the node from the cluster. This is also fairly simple, as you just adjust the expected quorum on a cluster member that’s staying, and then run some commands on the node being removed to remove all the clustering configurations and then reboot!

Just make sure you move the VMs off the node being removed first, or back them up. RIP to VM-AD02.

This is Overkill

After about a month of enjoying the general feeling of “I have a cluster!” and getting to listen to the fans gently wrrrr, I decided the power draw and added complexity of needing to update three nodes, and possibly fix/repair 3 nodes to maintain the quorum was simply not worth it to me. So it was time to go back to one node.

During the clustering, I had turned what was the original Proxmox node, my HP G4 Z4, into a Linux workstation/gaming machine in an attempt to spend more time using Linux. Due to this, I didn’t really want to use it as my Proxmox server again. A reasonable person might choose a R420 or to sell all the 1U servers and buy something more sensible. I decided to be a bit silly instead.

One Node to Rule Them All

I decided to just use the R620 as my main node. My reasoning is it has more than enough RAM and cores to handle any hosting or homelabbing I would need without needing to make any changes to the node. This also freed up the other 3 R420s for whatever I want to install. Currently, one is running TrueNAS with two SSDs and 6 SAS drives to explore running PBS as a VM on TrueNAS and passing the SAS pool directly.

Being able to test more bare metal configurations, and spin up the testing VMs right next to “production” VMs has also made testing much easier. I think after testing that, I might look into other compute clustering protocols and solutions, along with some other hypervisors.

Going in a Circle Doesn’t Mean You Didn’t Learn Something

Despite ending up in the same configuration as I started with, I learned a lot. First and foremost being that power usage should be a much higher item on my list when considering new solutions and projects. The next would be taking more time to document and plan out next steps. I did do some of this but I still wish I would have done a bit more. I also did get to get some hands-on time with clustering and what realistic failovers look like.

Most importantly, I got to test my backups. Through this process I had nuked VMs over and over, and restored them for various reasons, and each time the backup worked great. Knowing my backup solution works has certainly enabled my curiosity since if I break it and don’t have the time or interest in learning the fix, I can just restore back to my last backup.

I think everyone that has some interest in clustering and uptime should attempt and live with a cluster at least once. Some might find it very useful, such as those self hosting public facing services for a small business or hobby project. I don’t publicly host anything, and I’m the only user of the services of the VMs (outside of a Factorio server) so uptime only affects me.

Given how much time I spend at a computer, I’ll welcome that down time.

Thanks for stopping by, stay caffeinated. ☕️

Keep exploring

Find more builds like this

Follow related tags or subscribe via RSS so the next lab experiment lands in your reader automatically.

Subscribe via RSS