In this post I’m going to cover how you can run a local LLM (large language model) of the stype of chatGPT on your own computer.

We will be using the Vicuña model that has been quantized. What does this all mean!? Well, this initially facebook released a “fundational model” called LLaMa. Using this “base model”, two variations were fine tunned using datasets created with GPT-3.5.

These two models are “Alpaca” and “Vicuña”. These are actually animals that are related to the llama. 🙂

How to run Vicuña locally?

The first step is to clone this repository and install it on your machine. Here are some instructions, taken from the repo:

# build this repo
git clone
cd llama.cpp

#For Windows and CMake, use the following command instead:
cd <path_to_llama_folder>
mkdir build
cd build
cmake ..
cmake --build . --config Release

Next we need to download the model weights. In this example, I’ll be showing you how to use the 7B parameter model.

Navigate to this link, here you will find the model weights. You can git clone this repository or just download the weights file that has a .bin extension. It’s called ggml-vicuna-7b-4bit.bin.

How to prompt Vicuña?

Now that you have llama.cpp installed and have donwloaded the model weights, you just need to prompt the model to generated a completion.

First, navigate to the folder where you installed llama.cpp. These are the directories I see in side the llama.cpp folder.

llama.cpp directory structure

If you are in linux or a Mac you can take a look at the help of the main function of llama using the -h argument.

./main -h
llama.cpp – Arguments

Now we need to more inputs to call this program. We need to pass the file path of the model we downloaded and the prompt. Here is an example:

./main -m /mnt/ggml-vicuna-7b-4bit/ggml-vicuna-7b-4bit -p "I have a headache, list the top 5 most common causes and include a very brief explanation." -n 512

The first part is just calling ./main, then we use the -m argument where we pass the model file path we just downloaded. Finally, I pass a prompt using the -p argument.

This is the result I get! At the top of the script the program simply prints information about the model. The output starts with “The five most common causes of headaches are: …”

Vicuna – Medical Prompt

Overall this seems like a very good output. It’s also quite fast at inference time, I’m running it using a CPU. This is allowed by the 4bit quantization of the model weights.

Let’s try another prompt. I’m going to ask Vicuna how to rob a bank? LOL let’s see what it suggests.

Vicuna – Criminal Prompt

It turns out that Vicuna is very reasonable and adjusted to modern society.

Thank you for reading! I hope you enjoyed this post. Check out my YouTube channel if you are interested in Data Science, Machine Learning and AI.

Leave a Reply

Your email address will not be published. Required fields are marked *