LLM Tinkering Notes
Ollama Deployment on Nvidia and AMD platforms in windows environment
Links
Ollama download:https://ollama.com/download
Ollama official site:https://ollama.com
Ollama official GitHub:https://github.com/ollama/ollama/
Nvidia
Latest version of CUDA and CUDA toolkit should be installed. As long as the GPU spec is supported by Ollama will it automatically utilize it.
AMD
https://github.com/likelovewant/ollama-for-amd
No CUDA for AMD thus would a replacement file for ROCm needed for Ollama. The replacement file must accord to the chip spec (Shader ISA) of the system (such as gfx900). It is of best to check the spec on tech power up first to make sure the GPU is supported. The one I am using (RX6800s) has shader ISA of gfx1032, which is supported by Ollama official. Thus by downloading corresponding version of replacement will Ollama be able to recognize the AMD GPU of the system.
Memory usage
When the model running exceeds the current VRAM capacity, Ollama will automatically use part of system RAM. However the speed of model (tokens per second) will significantly drop due to the extra latency of RAM-VRAM communication.
Models
Ollama library features a collection of models
Official model list:
Model | Parameters | Size | Download |
---|---|---|---|
Gemma 3 | 1B | 815MB | ollama run gemma3:1b |
Gemma 3 | 4B | 3.3GB | ollama run gemma3 |
Gemma 3 | 12B | 8.1GB | ollama run gemma3:12b |
Gemma 3 | 27B | 17GB | ollama run gemma3:27b |
QwQ | 32B | 20GB | ollama run qwq |
DeepSeek-R1 | 7B | 4.7GB | ollama run deepseek-r1 |
DeepSeek-R1 | 671B | 404GB | ollama run deepseek-r1:671b |
Llama 4 | 109B | 67GB | ollama run llama4:scout |
Llama 4 | 400B | 245GB | ollama run llama4:maverick |
Llama 3.3 | 70B | 43GB | ollama run llama3.3 |
Llama 3.2 | 3B | 2.0GB | ollama run llama3.2 |
Llama 3.2 | 1B | 1.3GB | ollama run llama3.2:1b |
Llama 3.2 Vision | 11B | 7.9GB | ollama run llama3.2-vision |
Llama 3.2 Vision | 90B | 55GB | ollama run llama3.2-vision:90b |
Llama 3.1 | 8B | 4.7GB | ollama run llama3.1 |
Llama 3.1 | 405B | 231GB | ollama run llama3.1:405b |
Phi 4 | 14B | 9.1GB | ollama run phi4 |
Phi 4 Mini | 3.8B | 2.5GB | ollama run phi4-mini |
Mistral | 7B | 4.1GB | ollama run mistral |
Moondream 2 | 1.4B | 829MB | ollama run moondream |
Neural Chat | 7B | 4.1GB | ollama run neural-chat |
Starling | 7B | 4.1GB | ollama run starling-lm |
Code Llama | 7B | 3.8GB | ollama run codellama |
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
LLaVA | 7B | 4.5GB | ollama run llava |
Granite-3.3 | 8B | 4.9GB | ollama run granite3.3 |
Llama 3.1 8B and Deepseek R1 7B are two of the models that I consider best for most setups. They can be run on 8 Gigs of VRAM, size of only around 5 Gigs and give decent response quality. Obviously you are not expecting them to output content of identical quality as a 400B model but it is I think the bare minimum parameters for models to output relatively natural output rather than a really robot-feeling AI such as 1B or 3B models. | |||
Interestingly DeepseekR1 7B (distrilled qwen 7B) is actually of better performance than Deepseek R1 8B (distrilled Llama 3 8B) |