Based on this repo GitHub Link I am trying to build a system which answers users queries.
I was able to run the model on a CPU with response time of ~60s, now I want to improve the response time, so I am trying to load the model onto a GPU.
System specs
- Processor - Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, 2195 Mhz, 2 Core(s), 2 LogicalProcessor with 24GB RAM
- GPU - Nvidia A40-12Q with 12gb.
So here are my queries
- How to load the llama 2 or any model onto the GPU?
- Can we improve the response time if we load the model onto a GPU?
- How to improve the answer quality?
- How should we make the model to answer the questions related only to the documents?
The CODE
from langchain.llms import CTransformersfrom dotenv import find_dotenv, load_dotenvimport boximport yamlfrom accelerate import Acceleratorimport torchfrom torch import cudafrom ctransformers import AutoModelForCausalLM# Check if GPU is available and set device accordinglydevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")print(f"Using Device: {device} in llm.py file")# Load environment variables from .env fileload_dotenv(find_dotenv())# Import config varswith open('config/config.yml', 'r', encoding='utf8') as ymlfile: cfg = box.Box(yaml.safe_load(ymlfile))accelerator = Accelerator()def build_llm(): config = {'max_new_tokens': cfg.MAX_NEW_TOKENS,'temperature': cfg.TEMPERATURE,'gpu_layers': 150 } llm = CTransformers(model=cfg.MODEL_BIN_PATH, model_type=cfg.MODEL_TYPE, config= config ) llm,config = accelerator.prepare(llm,config) return llm
This is the part which loads in the model, but while querying, the cpu utilization shoots up till 100% and the GPU utilization remains at 2%