Although I like the article, the author doesnt acknowledge the current way that (I think) most people are utilizing llama.cpp at this point. ollama.com has simplified his work into two lines:
Wait.. there is one binary that executes just fine on half a dozen platforms? What wizardry is this?
edit: Their default LLM worked great on Windows. Fast inference on my 2080ti. You have to pass "-ngl 9999" to offload onto GPU or it runs on CPU. It is multi-modal too.
curl -fsSL https://ollama.com/install.sh |sh
ollama run llama3.2