Elon Musk's AI assistant Grok boasted that the billionaire had the "potential to drink piss better than any human in history," among other absurd claims.
Ahh, thank you—I had misunderstood that, since Deepseek is (more or less) an open-source LLM from China that can also be used and fine-tuned on your own device using your own hardware.
Do you have a cluster with 10 A100 lying around? Because that’s what it gets to run deepseek.
It is open source, but it is far from accessible to run on your own hardware.
I run quantized versions on deepseek that are usable enough for chat, and it’s on a home set that is so old and slow by today’s standards I won’t even mention the specs lol. Let’s just say the rig is from 2018 and it wasn’t near the best even back then.
I have a Ryzen 7800 gaming destkop, RTX 3090, and 128GB DDR5. Nothing that unreasonable. And I can run the full GLM 4.6 with quite acceptable token divergence compared to the unquantized model, see: https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF
If I had a EPYC/Threadripper homelab, I could run Deepseek the same way.
Yes, that’s true. It is resource-intensive, but unlike other capable LLMs, it is somewhat possible—not for most private individuals due to the requirements, but for companies with the necessary budget.
They’re overestimating the costs. 4x H100 and 512GB DDR4 will run the full DeepSeek-R1 model, that’s about $100k of GPU and $7k of RAM. It’s not something you’re going to have in your homelab (for a few years at least) but it’s well within the budget of a hobbyist group or moderately sized local business.
Since it’s an open weights model, people have created quantized versions of the model. The resulting models can have much less parameters and that makes their RAM requirements a lot lower.
You can run quantized versions of DeepSeek-R1 locally. I’m running deepseek-r1-0528-qwen3-8b on a machine with an NVIDIA 3080 12GB and 64GB RAM. Unless you pay for an AI service and are using their flagship models, it’s pretty indistinguishable from the full model.
If you’re coding or doing other tasks that push AI it’ll stumble more often, but for a ‘ChatGPT’ style interaction you couldn’t tell the difference between it and ChatGPT.
Thanks for the recommendation, I’ll look into GLM Air, I haven’t looked into the current state of the art for self-hosting in a while.
I just use this model to translate natural language into JSON commands for my home automation system. I probably don’t need a reasoning model, but it doesn’t need to be super quick. A typical query uses very few tokens (like 3-4 keys in JSON).
The next project will be some kind of agent. A ‘go and Google this and summarize the results’ agent at first. I haven’t messed around much with MCP Servers or Agents (other than for coding). The image models I’m using are probably pretty dated too, they’re all variants of SDXL and I stopped messing with ComfyUI before video generation was possible locally, so I gotta grab another few hundred GB of models.
Yeah, you do want more contextual intelligence than an 8B for this.
Oh yeah, I’m sure. I may peek at it this weekend. I’m trying to decide if Santa is going to bring me a new graphics card, so I need to see what the price:performance curve looks like.
Massive understatement!
I think I stopped actively using image generation a little bit after LoRAs and IP Adapters were invented. I was trying to edit a video (random meme gif) to change the people in the meme to have the faces of my family, but it was very hard to have consistency between frames. Since there is generated video, it seems like someone solved this problem.
Ahh, thank you—I had misunderstood that, since Deepseek is (more or less) an open-source LLM from China that can also be used and fine-tuned on your own device using your own hardware.
Do you have a cluster with 10 A100 lying around? Because that’s what it gets to run deepseek. It is open source, but it is far from accessible to run on your own hardware.
I run quantized versions on deepseek that are usable enough for chat, and it’s on a home set that is so old and slow by today’s standards I won’t even mention the specs lol. Let’s just say the rig is from 2018 and it wasn’t near the best even back then.
That’s not strictly true.
I have a Ryzen 7800 gaming destkop, RTX 3090, and 128GB DDR5. Nothing that unreasonable. And I can run the full GLM 4.6 with quite acceptable token divergence compared to the unquantized model, see: https://huggingface.co/Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF
If I had a EPYC/Threadripper homelab, I could run Deepseek the same way.
Yes, that’s true. It is resource-intensive, but unlike other capable LLMs, it is somewhat possible—not for most private individuals due to the requirements, but for companies with the necessary budget.
They’re overestimating the costs. 4x H100 and 512GB DDR4 will run the full DeepSeek-R1 model, that’s about $100k of GPU and $7k of RAM. It’s not something you’re going to have in your homelab (for a few years at least) but it’s well within the budget of a hobbyist group or moderately sized local business.
Since it’s an open weights model, people have created quantized versions of the model. The resulting models can have much less parameters and that makes their RAM requirements a lot lower.
You can run quantized versions of DeepSeek-R1 locally. I’m running deepseek-r1-0528-qwen3-8b on a machine with an NVIDIA 3080 12GB and 64GB RAM. Unless you pay for an AI service and are using their flagship models, it’s pretty indistinguishable from the full model.
If you’re coding or doing other tasks that push AI it’ll stumble more often, but for a ‘ChatGPT’ style interaction you couldn’t tell the difference between it and ChatGPT.
You should be running hybrid inference of GLM Air with a setup like that. Qwen 8B is kinda obsolete.
I dunno what kind of speeds you absolutely need, but I bet you could get at least 12 tokens/s.
Thanks for the recommendation, I’ll look into GLM Air, I haven’t looked into the current state of the art for self-hosting in a while.
I just use this model to translate natural language into JSON commands for my home automation system. I probably don’t need a reasoning model, but it doesn’t need to be super quick. A typical query uses very few tokens (like 3-4 keys in JSON).
The next project will be some kind of agent. A ‘go and Google this and summarize the results’ agent at first. I haven’t messed around much with MCP Servers or Agents (other than for coding). The image models I’m using are probably pretty dated too, they’re all variants of SDXL and I stopped messing with ComfyUI before video generation was possible locally, so I gotta grab another few hundred GB of models.
It’s a lot to keep up with.😮💨
Massive understatement!
Yeah, you do want more contextual intelligence than an 8B for this.
Actually SDXL is still used a lot! Especially for the anime stuff. It just got so much finetuning and tooling piled on.
Oh yeah, I’m sure. I may peek at it this weekend. I’m trying to decide if Santa is going to bring me a new graphics card, so I need to see what the price:performance curve looks like.
I think I stopped actively using image generation a little bit after LoRAs and IP Adapters were invented. I was trying to edit a video (random meme gif) to change the people in the meme to have the faces of my family, but it was very hard to have consistency between frames. Since there is generated video, it seems like someone solved this problem.