Run ChatGPT… on your Gaming PC? Part 2

5 min readFeb 14, 2023

In Part 1, I provided a backstory to my desire to run large language models on relatively inexpensive local hardware. In this part, let’s drop the chat and get to code. I outline some basic instructions and snippets of code that lets us install and spin up a model that can spit out near real time completions on a typical 2022 spec gaming PC.

Prerequisites

To make this work, you need a reasonably current NVIDIA Graphics card, a functional Python installation that can use it and some comfort with a shell terminal of your choice. Specifically:

a) A NVIDIA graphics card (GPU) with at least 8GB of VRAM so we can load a decent model. Unfortunately, most computational use of GPUs relies on CUDA which is supported on NVIDIA cards. 8GB should be easy as it has been standard on them for some time and if your PC is current and has a dedicated GPU, you probably have this.

b) Python 3.9. While version 3.10 is current as of this writing, it is not well supported by some of the libraries we need to install to get this going. If you don’t have a Python install, on Windows you can install 3.9 from the Microsoft Store or any other way you want. On Linux, getting the specific version may need an environment manager like pyenv. If you are running Linux with a serious GPU, you likely know what to do for your distribution.

c) CUDA allows applications like deep learning to use your GPU. You need version 11.7 as that is the version the next package we need currently supports. Install it from here.

d) PyTorch is a machine learning framework. In principle we can run RWKV-LM on other frameworks such as Tensorflow, PyTorch is quite flexible and relatively easy to setup. In your terminal of choice, run

pip3 install torch torchvision torchaudio \
  --extra-index-url https://download.pytorch.org/whl/cu117

e) To make this easy, we are going to use a wrapper RWKVSTIC. To get it working, we need to install a few more Python packages:

pip3 install inquirer tqdm transformers scipy

then the package itself.

pip3 install rwkvstic

f) Next we need to pick a pre-trained model we want to use. This is where we need to make some choices based on the hardware we have available and the performance we want. https://huggingface.co/BlinkDL has a list of models with a different number of parameters. e.g.
rwkv-4-pile-14b is a RWKV-4 model (the one we want) trained on the Pile with 14 billion parameters. As this is in active development and training, you should check each model for the current best checkpoint. As of writing, let’s try the 10–11–2022 checkpoint. Download this and let’s write a simple app to load it and make our first text generation.

RWKVSTIC

While you wait, you will note this is a 6GB download. These are the weights we need to load into VRAM and is directly related to the hardware we need to run this live. More parameters is in principle is more capable model and the package rwkvstic provides flexibility when it comes to VRAM usage but they usually come at a cost— you can:

a) Split the load across multiple GPUs (if you have them)

b) Stream the model from system RAM (slow, in my testing 30x slower)

c) Run at half resolution (odd results at the time of writing)

This is the current requirement if running at 16 bit resolution which is a reasonable balance.

Model | bf16/fp16
------|----------
14B   | 28GB  
7B    | 14GB 
3B    | 6GB  
1.5B  | 3GB

As you can see, 3 billion is safe for most current GPUs though if you have a high-end GPU, you can try the 7 billion model. Unfortunately, all consumer cards as of Feb 2023 cap out at 24GB of VRAM which means you need to use one of the alternatives approaches above, run 2 cards, or wait for the rumoured NVIDIA TITAN Ada release in a few months with 48GB of VRAM.

qa.py

import torch
from rwkvstic.load import RWKV
from rwkvstic.agnostic.backends import TORCH

# this is the dtype used for trivial operations, 
# such as vector->vector operations and is the dtype that will 
# determine the accuracy of the model
runtimedtype = torch.float32 # torch.float64, torch.bfloat16

# this is the dtype used for matrix-vector operations, 
# and is the dtype that will determine the performance 
# and memory/VRAM usage of the model. The alternatives
# will use more memory but may provide better results
dtype = torch.bfloat16 # alternatives are torch.float32, torch.float64

# use your GPU, CPU forward passes are about 30x - 100x slower
useGPU = True

# load the checkpoint we downloaded
model = RWKV("./RWKV-4-Pile-3B-20221110-ctx4096.pth", mode=TORCH, useGPU=useGPU, runtimedtype=runtimedtype, dtype=dtype)

# configure this as a Q/A bot
question = "Who is Tim Cook?"

# tune the parameters
number = 200 # the number of tokens to generate

stopStrings = ["\n\n"] # When read, the model will stop generating output
stopTokens = [0] # when the model has generated any of these tokens, it will stop generating output

# we want a factual bot so let's tone down the temperature to get a bland but hopefully correct answer
temp = 0.4 # the temperature of the model, higher values will result in more random output, lower values will result in more predictable output
top_p = 0.8 # the top_p of the model

# context is the magic that sets up the forward pass
model.loadContext(newctx=f"Q: {question}\n\nA:")
answer = model.forward(number=number, stopStrings=stopStrings, stopTokens=stopTokens, temp=temp, top_p_usual=top_p)

print("Q:", question)
print("A:", answer["output"]) # the generated output

$ python3 qa.py
init RWKVOPS, from super
100%|████████████████████| 582/582 [00:00<00:00, 2578.22it/s]
100%|████████████████████| 11/11 [01:50<00:00, 10.04s/it]
Q: Who is Tim Cook?
A:  Apple's CEO.

Do I really need a GPU?

Note, you can run this on a CPU backend by changing the useGPU boolean to False but prepare to wait — a forward pass on a i7–6850K CPU @ 3.60GHz takes around 5 minutes, expect similar times when running on the Apple M1 Max.

Measuring performance

While it is running, you can view the resource usage on your system to get a feel for the requirements of the model. There are a number of ways to do this but keeping it in Python land, we can use the glances module with an optional py3nvml package that gives us GPU usage.

$ pip3 install glances windows-curses py3nvml
$ python3 -m glances

While this is 17 lines of real code, most is boilerplate and configuration. By manipulating the loaded model, the temp and the context passed to model.loadContext, you can get these 17 lines of code to emit many reasonable and useful things beyond a Q/A bot. E.g. you can make a classifier or a code generator.

This entire project is currently in active development and APIs and checkpoints are updated on a daily basis. It does however give the first hint of a viable open alternative to the state-of-the art models that are deployed behind closed doors.

What a time to be alive.