I'm trying to make a TTS on a open source LLM with local API that is streaming me the response to my questions but it's very hard to do it and I find nothing on that subject.
Here's the code:
import threadingfrom queue import Queuefrom openai import OpenAIimport timetts_engine = pyttsx3.init()client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")history = [ {"role": "system", "content": "Vous êtes un assistant intelligent appelé Bob. Vous fournissez toujours des réponses rapides et précises, à la fois justes et utiles et toujours en langue française."}, {"role": "user", "content": "Bonjour, présentez-vous à quelqu'un qui ouvre ce programme pour la première fois. Soyez concis."},]while True: user_input = input("> ") history.append({"role": "user", "content": user_input}) start_time = time.time() # Temps de début de la requête à l'API completion = client.chat.completions.create( model="local-model", messages=history, temperature=0.8, stream=True, ) new_message = {"role": "assistant", "content": ""} for chunk in completion: if chunk.choices[0].delta.content: generated_text = chunk.choices[0].delta.content print(generated_text, end="", flush=True) new_message["content"] += generated_text history.append(new_message) end_time = time.time() response_time = end_time - start_time print("\nTemps de réponse de l'API:", response_time, "secondes")
So I tried multiple thing like a loop that is looking for the new word that is being generating but I think that is not good approach and it doesn't work.
Any suggestions?