LLM Tinkering Notes

Ollama Deployment on Nvidia and AMD platforms in windows environment

Ollama download:https://ollama.com/download
Ollama official site:https://ollama.com
Ollama official GitHub:https://github.com/ollama/ollama/

Nvidia

Latest version of CUDA and CUDA toolkit should be installed. As long as the GPU spec is supported by Ollama will it automatically utilize it.

AMD

https://github.com/likelovewant/ollama-for-amd
No CUDA for AMD thus would a replacement file for ROCm needed for Ollama. The replacement file must accord to the chip spec (Shader ISA) of the system (such as gfx900). It is of best to check the spec on tech power up first to make sure the GPU is supported. The one I am using (RX6800s) has shader ISA of gfx1032, which is supported by Ollama official. Thus by downloading corresponding version of replacement will Ollama be able to recognize the AMD GPU of the system.

Memory usage

When the model running exceeds the current VRAM capacity, Ollama will automatically use part of system RAM. However the speed of model (tokens per second) will significantly drop due to the extra latency of RAM-VRAM communication.

Models

Ollama library features a collection of models
Official model list:

Model Parameters Size Download
Gemma 3 1B 815MB ollama run gemma3:1b
Gemma 3 4B 3.3GB ollama run gemma3
Gemma 3 12B 8.1GB ollama run gemma3:12b
Gemma 3 27B 17GB ollama run gemma3:27b
QwQ 32B 20GB ollama run qwq
DeepSeek-R1 7B 4.7GB ollama run deepseek-r1
DeepSeek-R1 671B 404GB ollama run deepseek-r1:671b
Llama 4 109B 67GB ollama run llama4:scout
Llama 4 400B 245GB ollama run llama4:maverick
Llama 3.3 70B 43GB ollama run llama3.3
Llama 3.2 3B 2.0GB ollama run llama3.2
Llama 3.2 1B 1.3GB ollama run llama3.2:1b
Llama 3.2 Vision 11B 7.9GB ollama run llama3.2-vision
Llama 3.2 Vision 90B 55GB ollama run llama3.2-vision:90b
Llama 3.1 8B 4.7GB ollama run llama3.1
Llama 3.1 405B 231GB ollama run llama3.1:405b
Phi 4 14B 9.1GB ollama run phi4
Phi 4 Mini 3.8B 2.5GB ollama run phi4-mini
Mistral 7B 4.1GB ollama run mistral
Moondream 2 1.4B 829MB ollama run moondream
Neural Chat 7B 4.1GB ollama run neural-chat
Starling 7B 4.1GB ollama run starling-lm
Code Llama 7B 3.8GB ollama run codellama
Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored
LLaVA 7B 4.5GB ollama run llava
Granite-3.3 8B 4.9GB ollama run granite3.3
Llama 3.1 8B and Deepseek R1 7B are two of the models that I consider best for most setups. They can be run on 8 Gigs of VRAM, size of only around 5 Gigs and give decent response quality. Obviously you are not expecting them to output content of identical quality as a 400B model but it is I think the bare minimum parameters for models to output relatively natural output rather than a really robot-feeling AI such as 1B or 3B models.
Interestingly DeepseekR1 7B (distrilled qwen 7B) is actually of better performance than Deepseek R1 8B (distrilled Llama 3 8B)

LLM Tinkering Notes
http://mazarineecho.space/2025/08/16/LLM-Tinkering-Notes/
Author
Gorden
Posted on
August 16, 2025
Licensed under