Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

Loading "llama-2" 8 bit quantized version onto the GPU

$
0
0

Based on this repo GitHub Link I am trying to build a system which answers users queries.

I was able to run the model on a CPU with response time of ~60s, now I want to improve the response time, so I am trying to load the model onto a GPU.

System specs

  • Processor - Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, 2195 Mhz, 2 Core(s), 2 LogicalProcessor with 24GB RAM
  • GPU - Nvidia A40-12Q with 12gb.

So here are my queries

  1. How to load the llama 2 or any model onto the GPU?
  2. Can we improve the response time if we load the model onto a GPU?
  3. How to improve the answer quality?
  4. How should we make the model to answer the questions related only to the documents?

The CODE

from langchain.llms import CTransformersfrom dotenv import find_dotenv, load_dotenvimport boximport yamlfrom accelerate import Acceleratorimport torchfrom torch import cudafrom ctransformers import AutoModelForCausalLM# Check if GPU is available and set device accordinglydevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")print(f"Using Device: {device} in llm.py file")# Load environment variables from .env fileload_dotenv(find_dotenv())# Import config varswith open('config/config.yml', 'r', encoding='utf8') as ymlfile:    cfg = box.Box(yaml.safe_load(ymlfile))accelerator = Accelerator()def build_llm():    config = {'max_new_tokens': cfg.MAX_NEW_TOKENS,'temperature': cfg.TEMPERATURE,'gpu_layers': 150                             }    llm = CTransformers(model=cfg.MODEL_BIN_PATH,                        model_type=cfg.MODEL_TYPE,                        config= config                        )    llm,config = accelerator.prepare(llm,config)    return llm

This is the part which loads in the model, but while querying, the cpu utilization shoots up till 100% and the GPU utilization remains at 2%


Viewing all articles
Browse latest Browse all 12111

Trending Articles