Elon Musk’s Grok Goes Haywire, Boasts About Billionaire’s Pee-Drinking Skills and ‘Blowjob Prowess’

RandAlThor@lemmy.ca · 22 hours ago

Elon Musk’s Grok Goes Haywire, Boasts About Billionaire’s Pee-Drinking Skills and ‘Blowjob Prowess’

brucethemoose@lemmy.world · edit-2 1 hour ago

Oh no, you got it backwards. The software is everything, and ollama is awful. It’s enshittifying: don’t touch it with a 10 foot pole.

Speeds are basically limited by CPU RAM bandwidth. Hence you want to be careful doubling up RAM, and doubling it up can the max speed (and hence cut your inference speed).

Anyway, start with this. Pick your size, based on how much free CPU RAM you want to spare:

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF

The “dense” parts will live on your 3080 while the “sparse” parts will run on your CPU. The backend you want is this, specifically the built-in llama-server:

https://github.com/ikawrakow/ik_llama.cpp/

Regular llama.cpp is fine too, but it’s quants just aren’t quite as optimal or fast.

It has two really good built-in web UIs: the “new” llama.cpp chat UI, and mikupad, which is like a “raw” notebook mode more aimed at creative writing. But you can use LM Studio if you want, or anything else; there are like a bazillion frontends out there.