Run ChatGPT… on your Gaming PC? Part 1

Rahul Powar
6 min readFeb 11, 2023

It’s click bait, I apologize. OpenAIs models are not in the public domain so while they are now easier to use and more available than ever before, this is not something you can ‘download’ and ‘run’. There are reasons why this would be desirable that I will get to later. While you can’t do this with ChatGPT, there seems to be something else on the horizon that may open up options for state of the art generative language models.

First, a bit of background. I was lucky enough to get access to the closed beta of GPT-3 on the 25th of July 2020 and it blew my mind. I spent many nights discovering the subtleties of what is now called prompt engineering and as API access was unmetered, I experimented with every use case I could think of. I created a bot that used tricks with dynamic temperature estimation powered by GPT-3 (I called it “GPT-3 Inception” in my user group feedback session) and put a Slack interface on it. It produced some of the most entertaining exchanges I have ever had with a machine, dynamically switching between fact and humor.

Tell me about particle physics, but make me laugh because it’s 2am. A 2020 vintage.

As expected, GPT-3 was made commercially available and my toys would now bust my monthly token credits in a few interactions. I was busy with my day job running my SaaS business so I reluctantly turned my bot off and starting using GPT-3 for some more specific NLP tasks.

GitHub Copilot launched later that packaged a fine tuned model for developers helping them write code faster. It was incredible but only for those who wrote code which remains a vanishingly small part of the population. Then ChatGPT arrived and blew up. It’s a GPT-3.5 with some tuning that makes it less general purpose but a great out of the box Q/A solution. Many of our human problems can be solved through chat so that limitation is not a huge issue for its general usefulness and it made a large number of people realize just how useful AI can be. For the more thoughtful among us, what this next industrial revolution means for knowledge work and the way western society and related economic activity is structured today. But back to technology, while its utility is shocking to people who have never touched modern AI directly, it’s not really conceptually new and more practically does not let me resurrect my personal AI.

There is a profound benefit to open innovation and the ability for the curious and the capable to lift the hood tinker. After GPT-3, a number of open source massive language models arrived on the scene including BLOOM, OPT from Meta.

Excellent bake off of the 4 big transformer models

These new models showed promise but there in is a new problem. Even if you know enough to spin up these models and run them, you need serious compute resources at your disposal. Your current gen laptop is not going to cut it. Many people know that these models are very large. That refers to the number of ‘parameters’ these models use to encode all of human knowledge. All 4 of them are in the order of 170 billion parameters. There are smaller models available but they do not offer the incredible performance of the full size models. There are 2 implications that stem from this very large size:

a) They are incredibly expensive to train. When GPT-3 came out, it was trained in a dataset that didn’t know about COVID because it had not happened yet. It took a while to retrain the model because it likely cost around $4.5m to train the model at the time. More recently, the team behind BLOOM estimated it cost €3M. It’s great the BLOOM and Meta foot the bill but this huge size leads to a second problem.

b) Inference, that is actually using the model, also requires a lot of capacity. The way these models work, they need to be loaded into memory and GPUs, specialized graphics processors normally used for games, need to perform billions of matrix operations to generate the text responses we finally see. GPUs operate on their own dedicated memory in most PCs and a consumer GPU is way too small to hold these 170 billion parameters. If I want to run my own BLOOM or OPT model, I need to get hold of a datacenter GPU that comes with tons of memory for just this task. State of the art would be a NVIDIA A100 with 80GB of memory. Unfortunately that is $50,000 for a GPU. In addition, 80GB is not enough, I will need around 8 of them and a server to host that many high bandwidth devices. Let’s say a personal capex of half a million dollars if I really want it in my server rack and am willing to cut a few corners. I could of course spin them up in the cloud, assuming I can get these A100 80GB (I hear they are only available on Azure), but at that point I am going to be paying more than a metered GPT-3.

One might reasonably ask why I would want to do this. There are 3 themes to this. Curiosity, Security and Performance.

Like many people, I get great satisfaction from being able to pop the hood on things that look like magic and understand them. At a more specific level, if I want to build a real AI for use by myself and my family, I don’t want our data to leave my network. We don’t have a retail smart speaker in the house for this reason. It also holds the potential for the next step up in human/machine interaction. Local models with dedicated computation can turn results around faster, if you want a natural conversation you need speed of response. These models can also be chained together to do more sophisticated things. If I want to live in the future, I got to get me one of those.

At some point, new GPUs or dedicated novel architectures to solve this will emerge. But for the impatient, there is a new kid on the block that side steps this issue. Enter RWKV-LM.

The models discussed are massive because of their design. They are all transformers which is the current hotness in deep learning. Great performance, but prohibitive cost in this natural language use case.

There is however an older style of generative AI — the Recurrent Neural Networks. RWKV-LM brings this older architecture up to date and carries along the benefit of size. It’s a LOT smaller. The largest model trained today is 14 billion parameters and it’s pretty impressive.

Now the comparison above is against lower end transformers so there isn’t an expectation that it is going to beat state-of-the-art. However, at 14 billion parameters, this can start to fit on a consumer grade high end PC a.k.a. my gaming PC. Did it work? 100%.

Short form Q&A.

Q: What is the meaning of life?

A: That’s a hard one to answer.

Q: What’s the secret to a good marriage?

A: To do your own thing.

Q: How do you get rid of a headache?

A: By getting up and moving around.

Q: What is the most important thing in life?

A: Relationships.

Ideation.

Prompt: Here’s a short cyberpunk sci-fi adventure story.

The Story:

RWKV-LM: Our World Is Not the Whole Story

Artist Brian decided to escape from his artificial world.

This short story explores the idea that our world is only one of many realities in which we live. This world is not the whole story.

I decided to explore the idea that our reality is not the whole story. I am not suggesting that this story is a serious reflection of our world. It is, instead, a lighthearted fantasy in which Brian tries to escape the artificial world he lives in.

I have always been fascinated by the idea that our world is only one of many realities that make up the universe. I think that this is a concept that is universal to our species.

The smaller 7 billion model is available online on Hugging Face.

Want to run it on your own PC? Python instructions in Part 2.

--

--

Rahul Powar

Technologist, Entrepreneur, TCK. Founder & CEO of @redsift. Previously creator of @shazam, VP @thomsonreuters, founder & CEO of @apsmart (acquired 2012).