If you were expecting four nodes to work four times faster than one, you’ll be disappointed: some models were even slower. QWEN 3 235B running on llama.cpp, for instance, fell from 20.4 tokens/second on one node to 15.2 on four, presumably because llama.cpp has to use old-fashioned RTC calls to share, and they ain’t efficient. The latest version of EXO supports RDMA, which benefited a lot, jumping from 19.55 to 3 1.2 t/s on QWEN3.

He also tested the monstrous Kimi K2 Thinking, a 1 trillion (that’s with a T) parameter model that turns most AI servers into blubbering wrecks. The $40K Apple system managed it, though (or at least a quantization with 32 billion parameters active), managing a non-blubbering wreck 28.3 t/s. 

RDMA over TB5 on Macs ain’t all that stable yet, though: Geerling found a lot of issues in his tests on pre-release versions of EXO, commenting that:

“When it works, it works great. When it doesn’t… well, let’s just say I was glad I had Ansible set up so I could shut down and reboot the whole cluster quickly.”

As William Gibson once said, the future has arrived – it’s just not evenly distributed yet.

Pages: 1 2


Leave a Reply

Your email address will not be published. Required fields are marked *