Wow, on a cursory first look this looks pretty amazing…
(and multimodal models are apparently now called “omni” modal)
… Qwen3-Omni
“capable of understanding text, audio, images, and video, as well as generating speech in real time.”
Will be adding this to the self-hosted stack I run at home to put it through its paces!
https://github.com/QwenLM/Qwen3-Omni
资料修改成功