Running Large Language Models on a VPS
Contents
This post documents how to run LLMs on a 1C1G VPS using Ollama.
I purchased a 1C1G AMD Ryzen 9 7950X VPS during Black Friday, which barely manages to run LLMs. Here’s how to quickly install and run LLMs on such a VPS.
Configuration
Hardware Configuration
---------------------Basic Information Query---------------------
CPU Model : AMD Ryzen 9 7950X 16-Core Processor
CPU Cores : 1
CPU Frequency : 4491.540 MHz
CPU Cache : L1: 64.00 KB / L2: 512.00 KB / L3: 16.00 MB
AES-NI Instruction Set: ✔ Enabled
VM-x/AMD-V Support: ✔ Enabled
Memory : 90.74 MiB / 960.70 MiB
Swap : 0 KiB / 2.00 MiB
Disk Space : 1.12 GiB / 14.66 GiB
---------------------CPU Test-------------------------
-> CPU Test (Fast Mode, 1-Pass @ 5sec)
1 Thread Test (Single-Core): 6402 Scores
---------------------Memory Test-----------------------
-> Memory Test (Fast Mode, 1-Pass @ 5sec)
Single Thread Read Test: 75694.60 MB/s
Single Thread Write Test: 42458.49 MB/s
Software Configuration
- Inference Engine: Pure CPU inference, using Ollama as the inference engine.
- Model Selection: Qwen2.5-0.5b model in Q4 quantized version, with a size of less than 400MB, suitable for 1GB of memory.
Ollama
Installation and Running the Model
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen2.5:0.5b
Conducting a Conversation
>>> hello, who are you?
I am Qwen, an AI language model developed by Alibaba Cloud. I was trained using millions of natural language processing (NLP) examples from the internet and my responses are generated through advanced neural network algorithms. My primary goal is to assist with tasks such as text generation, summarization, answering questions, and more. If you have any questions or need further clarification on a topic, feel free to ask!
To exit the conversation, type /bye
.
>>> /bye
Performance Testing
-
Download the Test Script
wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.1/llmapibenchmark_linux_amd64
-
Set Script Permissions
chmod +x ./llmapibenchmark_linux_amd64
-
Run the Performance Test
./llmapibenchmark_linux_amd64 -base_url="http://127.0.0.1:11434/v1" -concurrency=1,2,4 #optional
Example Output
################################################################################################################
LLM API Throughput Benchmark
https://github.com/Yoosu-L/llmapibenchmark
Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms
| Concurrency | Generation Throughput (tokens/s) | Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
| 1 | 31.88 | 976.60 | 0.05 | 0.05 |
| 2 | 30.57 | 565.40 | 0.07 | 0.16 |
| 4 | 31.00 | 717.96 | 0.11 | 0.25 |
Uninstall Ollama (if no longer needed)
# Stop Ollama service:
sudo systemctl stop ollama
# Disable Ollama service:
sudo systemctl disable ollama
# Remove Ollama service file:
sudo rm /etc/systemd/system/ollama.service
# Remove Ollama binary files:
sudo rm /usr/local/bin/ollama
# sudo rm /usr/bin/ollama
# sudo rm /bin/ollama
Disclaimer
This tutorial is for entertainment purposes only. The 0.5b LLM is not suitable for production and will consume significant CPU and memory bandwidth during inference, potentially affecting the performance of neighboring systems.