TL;DR: I can run some basic inference and tts, but there is no proper pipeline or any integration available with home assistant…, so I’ll next go back to rhasspy.
In my last post I setup my Roborock S7 (aka Rocki) with home assistant and setup a voice assistant with the voice preview device and google gemini models.
In this post, I want to document my exploration into running a proper speech-speech-omni model and control Rocki. First step, get the model to run and somehow be able to input speech.
Following: https://github.com/ictnlp/LLaMA-Omni2
git clone https://github.com/ictnlp/LLaMA-Omni2
cd LLaMA-Omni2
# sidetrack to install anaconda: go to https://repo.anaconda.com/archive/
# i selected https://repo.anaconda.com/archive/Anaconda3-2025.06-1-Linux-x86_64.sh
# I run linux in WSL on windows
conda create -n llama-omni2 python=3.10
conda activate llama-omni2
pip install -e .
# now run python shell
python
> import whisper
> model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
> exit()
huggingface-cli download --resume-download ICTNLP/cosy2_decoder --local-dir models/cosy2_decoder
model_name=LLaMA-Omni2-7B
huggingface-cli download --resume-download ICTNLP/$model_name --local-dir models/$model_name
# it's downloading a lot of large files...
# maybe the 7B is a little big for my RTX3060 with 12GB VRAM, and also might be slow, so for testing, let's get the smallest 0.5B model.
# And who knows how this will work, like does whisper run in parallel, then blocking VRAM?
model_name=LLaMA-Omni2-0.5B
huggingface-cli download --resume-download ICTNLP/$model_name --local-dir models/$model_name
# FIX 1: now somehow we need matcha-tts I ran into errors and doing this allows demo to run
pip install matcha-tts
# FIX 2: install ffmpeg (source: https://gist.github.com/ScottJWalter/eab4f534fa2fc9eb51278768fd229d70)
sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install ffmpeg
# open 3 terminals, make sure to activate the conda environment in each
# 1)
python -m omni_speech.serve.controller --host 0.0.0.0 --port 10000
# 2)
python -m llama_omni2.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --vocoder-dir models/cosy2_decoder
# this has problems: jsonable_encoder stuff, gemini recommended to:
pip install --upgrade pydantic
pip install --upgrade fastapi
# problem persists...
# 3)
model_name=LLaMA-Omni2-0.5B
python -m llama_omni2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path models/$model_name --model-name $model_name
Ok, it didn’t work out of the box. 😐
Trying the local inference python script. This works!
I adapted the questions, recorded my own audio and you get a response:
“Why is the sky blue?”
###questions.json
[
{
"id": "helpful_base_0",
"conversation": [
{
"from": "human",
"speech": "examples/wav/whyskyblue.wav"
}
]
}
]
##
output_dir=examples/$model_name
mkdir -p $output_dir
python llama_omni2/inference/run_llama_omni2.py \
--model_path models/$model_name \
--question_file examples/questions.json \
--answer_file $output_dir/answers.jsonl \
--temperature 0 \
--s2s
python llama_omni2/inference/run_cosy2_decoder.py \
--input-path $output_dir/answers.jsonl \
--output-dir $output_dir/wav \
--lang en
I adapted it with my own question: Why is the sky blue.
run_llama_omni2.py takes ~18.6s
run_cose2_decoder.py takes ~14.1s
Result:
So: I think I need some more out of the box approach here. Maybe go back to Rhasspy?!