If you were expecting four nodes to work four times faster than one, you’ll be disappointed: some models were even slower. QWEN 3 235B running on llama.cpp, for instance, fell from 20.4 tokens/second on one node to 15.2 on four, presumably because llama.cpp has to use old-fashioned RTC calls to share, and they ain’t efficient. The latest version of EXO supports RDMA, which benefited a lot, jumping from 19.55 to 3 1.2 t/s on QWEN3.
He also tested the monstrous Kimi K2 Thinking, a 1 trillion (that’s with a T) parameter model that turns most AI servers into blubbering wrecks. The $40K Apple system managed it, though (or at least a quantization with 32 billion parameters active), managing a non-blubbering wreck 28.3 t/s.
RDMA over TB5 on Macs ain’t all that stable yet, though: Geerling found a lot of issues in his tests on pre-release versions of EXO, commenting that:
“When it works, it works great. When it doesn’t… well, let’s just say I was glad I had Ansible set up so I could shut down and reboot the whole cluster quickly.”
As William Gibson once said, the future has arrived – it’s just not evenly distributed yet.

Leave a Reply