How to setup a Local LLM – part #1 – Husam Yaghi د. حسام ياغي

By: Husam Yaghi

As an AI enthusiast and developer, I recently embarked on an exciting journey to install and fine-tune a local Large Language Model (LLM) on my Windows PC. This project, which took several weeks to perfect, was both challenging and incredibly rewarding. In this blog post, I’ll share my experience and the code that finally worked for me, hoping to help others who wish to set up their own local LLM.

Let’s walk through all the steps

1. Download & Install the following applications:

- Python: https://www.python.org
- Install Torch: https://pytorch.org
- Install GIT: https://git-scm.com
- Install C++ Build Tools: https://visualstudio.microsoft.com
- Install Visual Studio Build Tools: https://code.visualstudio.com
- Install CUDA: https://developer.nvidia.com/cuda-toolkit
- Install Anaconda: https://www.anaconda.com/download

2. Open a Windows command prompt

Setup a virtual environment for the project:

python -m venv myproject_env

Activate this virtual environment:

myproject_env\Scripts\activate

Install prerequisites:

cd c:\local_llm\ GPT2-Medium

pip install –upgrade transformers

pip install sentencepiece protobuf

pip install SpeechRecognition

pip install transformers torch numpy scikit-learn

pip install torch transformers datasets PyPDF2 python-docx pandas python-pptx pyttsx3

pip install PyAudio

pip install huggingface_hub

Configuration

Create a config.ini file

Click HERE for the complete code

Data Preparation

- Prepared your local dataset: “C:\Local_LLM\yaghiDataSet”
- Copy files (PDF, DOCX, PPTX, TXT) into this dataset directory

Fine-tuning the Model

Before we can use our local LLM, we need to fine-tune it on our dataset and use an optimization tool. Let’s look at some key parts of the finetune.py script:

Click HERE for the complete code

Loading and Using the Fine-tuned Model

- Create a loading script load.py
- Implement the various module which you may require: question, answer, summary, follow-up, voice, history, etc
- Implement a function for vectorization and cosine similarity for finding relevant context
- Create the GUI for interacting with the model
- In the GUI, add buttons to reflect the functions you implemented

Click HERE for the complete code

A GUI windows should open

- Type in your prompt
- Click on “Ask”
- Check out the response
- You can click on “Follow up” for more information

Conclusion

Setting up a local LLM and fine-tuning it for specific tasks is a complex but rewarding process. Through this project, I’ve gained valuable insights into the workings of language models, data processing, and creating interactive AI applications. I have experimented with several models: Mistral-7B-Instruct-v0.3-GGUF, glm-4vq, gpt2-medium, Llama-2-7b-chat-hf, XLM-RoBERTa , and others. These models different significantly in their outcome, but the specifications of your local hardware (desktop, laptop, or cloud) plays a major role in determining which model to use.

I encourage readers to experiment with these concepts and adapt them to their own projects. Remember to respect licensing terms for any pre-trained models or libraries you use, and always be mindful of the ethical implications of AI applications.

General Concepts Clarification:

Large Language Models (LLMs):
Large Language Models, or LLMs, are advanced AI systems trained on vast amounts of text data. They can understand and generate human-like text. Think of them as very sophisticated autocomplete systems that can not only predict the next word but understand context, answer questions, and even perform complex language tasks. Popular examples include GPT-3, BERT, and GPT-2, which we’re using in this project.
Fine-tuning:
Fine-tuning is the process of taking a pre-trained model (like GPT-2) and further training it on a specific dataset for a particular task. It’s like teaching a general knowledge expert to become a specialist in your field. This process allows the model to learn the nuances and specifics of your data, improving its performance on your particular use case.
CUDA and GPU Acceleration:
CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA. When we talk about GPU acceleration, we’re referring to the use of a graphics processing unit (GPU) to speed up computations. GPUs are particularly good at handling the types of calculations needed for training and running neural networks, making the process much faster than using a CPU alone.
TF-IDF Vectorization:
TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to reflect how important a word is to a document in a collection. The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the entire corpus. This helps to adjust for the fact that some words appear more frequently in general.
Cosine Similarity:
Cosine similarity is a measure of similarity between two non-zero vectors. In the context of our project, we use it to compare the similarity between the question vector and our document vectors. It measures the cosine of the angle between two vectors, providing a similarity score between -1 and 1. A score of 1 means the vectors are identical, 0 means they’re perpendicular (completely dissimilar), and -1 means they’re diametrically opposed.
Named Entity Recognition (NER):
Named Entity Recognition is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as person names, organizations, locations, etc. In our project, we use NER to identify specific entities in questions, which helps in finding more relevant information in our dataset.
Tokenization:
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, characters, or subwords. For example, the sentence ‘I love AI!’ might be tokenized into [‘I’, ‘love’, ‘AI’, ‘!’]. Tokenization is a crucial step in natural language processing as it converts text into a format that machine learning models can understand and process.
Transformers:
Transformers are a type of deep learning model architecture that has revolutionized natural language processing. Unlike traditional sequential models, transformers can process all parts of the input data simultaneously, making them particularly effective for understanding context in language. The GPT (Generative Pre-trained Transformer) models we’re using are based on this architecture.

It is worth mentioning that I experimented with various options (features):

In one script, I added support for Voice interactions (voice input prompt & voice output response).
I also added an option to instruct the LLM to obtain a response using the public LLM.
There are plenty of ideas to keep you busy for a long time. You would not want to leave your chair from the excitement while perfecting the script.