LLM Tinkering Notes

Ollama Deployment on Nvidia and AMD platforms in windows environment

Links

Ollama download：https://ollama.com/download
Ollama official site：https://ollama.com
Ollama official GitHub：https://github.com/ollama/ollama/

Nvidia

Latest version of CUDA and CUDA toolkit should be installed. As long as the GPU spec is supported by Ollama will it automatically utilize it.

AMD

https://github.com/likelovewant/ollama-for-amd
No CUDA for AMD thus would a replacement file for ROCm needed for Ollama. The replacement file must accord to the chip spec (Shader ISA) of the system (such as gfx900). It is of best to check the spec on tech power up first to make sure the GPU is supported. The one I am using (RX6800s) has shader ISA of gfx1032, which is supported by Ollama official. Thus by downloading corresponding version of replacement will Ollama be able to recognize the AMD GPU of the system.

Memory usage

When the model running exceeds the current VRAM capacity, Ollama will automatically use part of system RAM. However the speed of model (tokens per second) will significantly drop due to the extra latency of RAM-VRAM communication.

Models

Ollama library features a collection of models
Official model list:

Model	Parameters	Size	Download
Gemma 3	1B	815MB	`ollama run gemma3:1b`
Gemma 3	4B	3.3GB	`ollama run gemma3`
Gemma 3	12B	8.1GB	`ollama run gemma3:12b`
Gemma 3	27B	17GB	`ollama run gemma3:27b`
QwQ	32B	20GB	`ollama run qwq`
DeepSeek-R1	7B	4.7GB	`ollama run deepseek-r1`
DeepSeek-R1	671B	404GB	`ollama run deepseek-r1:671b`
Llama 4	109B	67GB	`ollama run llama4:scout`
Llama 4	400B	245GB	`ollama run llama4:maverick`
Llama 3.3	70B	43GB	`ollama run llama3.3`
Llama 3.2	3B	2.0GB	`ollama run llama3.2`
Llama 3.2	1B	1.3GB	`ollama run llama3.2:1b`
Llama 3.2 Vision	11B	7.9GB	`ollama run llama3.2-vision`
Llama 3.2 Vision	90B	55GB	`ollama run llama3.2-vision:90b`
Llama 3.1	8B	4.7GB	`ollama run llama3.1`
Llama 3.1	405B	231GB	`ollama run llama3.1:405b`
Phi 4	14B	9.1GB	`ollama run phi4`
Phi 4 Mini	3.8B	2.5GB	`ollama run phi4-mini`
Mistral	7B	4.1GB	`ollama run mistral`
Moondream 2	1.4B	829MB	`ollama run moondream`
Neural Chat	7B	4.1GB	`ollama run neural-chat`
Starling	7B	4.1GB	`ollama run starling-lm`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
LLaVA	7B	4.5GB	`ollama run llava`
Granite-3.3	8B	4.9GB	`ollama run granite3.3`
Llama 3.1 8B and Deepseek R1 7B are two of the models that I consider best for most setups. They can be run on 8 Gigs of VRAM, size of only around 5 Gigs and give decent response quality. Obviously you are not expecting them to output content of identical quality as a 400B model but it is I think the bare minimum parameters for models to output relatively natural output rather than a really robot-feeling AI such as 1B or 3B models.
Interestingly DeepseekR1 7B (distrilled qwen 7B) is actually of better performance than Deepseek R1 8B (distrilled Llama 3 8B)

#LLM

LLM Tinkering Notes

http://mazarineecho.space/2025/08/16/LLM-Tinkering-Notes/

Author

Gorden

Posted on

August 16, 2025

Licensed under

Pitch Resources | MCS Previous

Official Website | MCS Next