According to their tech/marketing papers, it's supposedly multi-modal, encoding audio to tokens.