Meet Number 6

_{_{Battlestar Galatica 2003 Remake}}

Ok, not that number six, this number 6.

This is my new AI server, named after Number 6 from one of the best sci-fi shows ever made Battlestar Galatica (2003 reboot). I tried to come up with a cool AI name that was based on a movie or show, then I thought about one of the best shows ever made.

What is this thing?

Technically, it's my old computer before my recent pc upgrade. I've been looking for a decent AI server that is local but fast enough that I don't want to throw it out of the window.

I had a lot of fun tinkering with the Strix Halo but it is just a toy and extremely overrated for any actual workload. It is great to tinker with and play with much larger models than you usually can.

Specs

AMD 5950X Asus Dark Hero AM4 Motherboard 32G DDR4 Ram Dual Nvidia RTX 6000 Pro 600W Workstation Edition

It doesn't look like much, until you see the GPUs. These are faster 5090's with a full 96GB vram each giving me a total of 192G vram to work with. This system was just sitting in the closet and would be a perfect choice for two RTX 6000 Pro with very minor performance loss off a top of the line AM5 system.

I did run into a few problems, the major one was the PSU did not fit in this case. It is considering larger than a typical PSU so I am currently sitting it on the desk behind the case. I needed a larger power supply as well as one that supports two 12V 2x6 cables.

I am planning on another H9 Flow case like my main system that will fit the larger PSU and give it even more airflow.

While 32G DDR4 ram is not impressive, I do not plan to do any CPU offloading and will only use the GPUs. I used to have more ram in this system but it isn't needed.

What am I running on it?

I am currently testing models and tweaking performance. I am currently running GLM 4.5 Air FP8 on SGLang while I wait for GLM 4.6 Air to be released.

GLM 4.5 Air is a 106B parameter model from Z.ai and considered the best model at this size. The model weights alone come in around 118GB and everything else like kv cache and context you can push 192G easily.

I have a few other models I want to test. I have been using GPT-OSS-120B locally for a while, I was able to get 50 tokens/sec on the Strix Halo, which is really good but it does really poorly under real usage due to the slow ram making prompt processing really slow.

While GLM 4.5 Air is a slightly smaller model 106B parameters compared to 120B of GPT-OSS-120B, it has 4x the active parameters when in use. These models are called MoE or Mixture of Experts models. They are really popular as you can get some amazing speeds due to only some of the parameters being active at a time. For GPT-OSS-120B, only 3B parameters are active when being used where GLM 4.5 Air uses 12B. I am also running GLM 4.5 Air at FP8 quantization which is twice as large as GPT-OSS-120B's MXFP4. In other words, GLM 4.5 Air is considerably more demanding to run.

I have been seeing as much as 138 tokens/second peak from this rig on GLM 4.5 Air, with most requests giving me 100-120 tokens/second. Even at 122K context, I am still seeing around 75 tokens/second. The prompt processing however is very high making it really quick to spit out the first token (ttft).

Here you can see it summarizing a book that represents around 127,00 tokens, very close to the maximum 131,072 tokens this model is capable of. As you fill up the context window with data, models get a lot slower. I am still able to reach an impressive 77 tokens per second at max context.

This thing is a beast, ideally I want to be running the full 357B parameter GLM 4.6 but until DDR6 is released, I will stick with this two GPU setup.

Power usage

Here is where things get interesting. To summarize that book with 127K tokens trips my UPS at almost 1400W draw.

The PSU can handle 1500W, but there are other things on it.

If I power limit the two GPUS to 300W, I can drastically reduce the power draw down to 784W

Surely this means I am getting around half the speed?

I lost a whole 3 tokens/second! I lost about 3.9% in performance but reduced power usage by 43.43%!

Quite a fair trade I must say. Nvidia does make a model of the card called MaxQ that are fixed to 300W total and have a different fan style. I didn't want to pay the same amount for a less potential power if I decide to use them differently.

Why?

Good question. My primary reason is for analyzing stock data. I do not believe AI can predict price action, they can however churn through a massive amount of data and if driven properly can increase your edge or alpha. I already heavily use AI for trading, much of which is using cloud providers, but I don't want my data going to third parties.