Contents

Running Large Language Models on a VPS

This post documents how to run LLMs on a 1C1G VPS using Ollama.

I purchased a 1C1G AMD Ryzen 9 7950X VPS during Black Friday, which barely manages to run LLMs. Here’s how to quickly install and run LLMs on such a VPS.

Configuration

Hardware Configuration

---------------------Basic Information Query---------------------
 CPU Model         : AMD Ryzen 9 7950X 16-Core Processor
 CPU Cores         : 1
 CPU Frequency     : 4491.540 MHz
 CPU Cache         : L1: 64.00 KB / L2: 512.00 KB / L3: 16.00 MB
 AES-NI Instruction Set: ✔ Enabled
 VM-x/AMD-V Support: ✔ Enabled
 Memory            : 90.74 MiB / 960.70 MiB
 Swap              : 0 KiB / 2.00 MiB
 Disk Space        : 1.12 GiB / 14.66 GiB
---------------------CPU Test-------------------------
 -> CPU Test (Fast Mode, 1-Pass @ 5sec)
 1 Thread Test (Single-Core): 6402 Scores
---------------------Memory Test-----------------------
 -> Memory Test (Fast Mode, 1-Pass @ 5sec)
 Single Thread Read Test: 75694.60 MB/s
 Single Thread Write Test: 42458.49 MB/s

Software Configuration

  1. Inference Engine: Pure CPU inference, using Ollama as the inference engine.
  2. Model Selection: Qwen2.5-0.5b model in Q4 quantized version, with a size of less than 400MB, suitable for 1GB of memory.

Ollama

Installation and Running the Model

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen2.5:0.5b

Conducting a Conversation

>>> hello, who are you?
I am Qwen, an AI language model developed by Alibaba Cloud. I was trained using millions of natural language processing (NLP) examples from the internet and my responses are generated through advanced neural network algorithms. My primary goal is to assist with tasks such as text generation, summarization, answering questions, and more. If you have any questions or need further clarification on a topic, feel free to ask!

To exit the conversation, type /bye.

>>> /bye

Performance Testing

  1. Download the Test Script

    wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.1/llmapibenchmark_linux_amd64
    
  2. Set Script Permissions

    chmod +x ./llmapibenchmark_linux_amd64
    
  3. Run the Performance Test

    ./llmapibenchmark_linux_amd64 -base_url="http://127.0.0.1:11434/v1" -concurrency=1,2,4 #optional
    

Example Output

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time:2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

Uninstall Ollama (if no longer needed)

# Stop Ollama service:
sudo systemctl stop ollama

# Disable Ollama service:
sudo systemctl disable ollama

# Remove Ollama service file:
sudo rm /etc/systemd/system/ollama.service

# Remove Ollama binary files:
sudo rm /usr/local/bin/ollama
# sudo rm /usr/bin/ollama
# sudo rm /bin/ollama

Disclaimer

This tutorial is for entertainment purposes only. The 0.5b LLM is not suitable for production and will consume significant CPU and memory bandwidth during inference, potentially affecting the performance of neighboring systems.