A real wow moment for me.

I learnt last week from the team at Moonshine.ai about their newly released streaming encoder “v2” implementation of the Moonshine automatic speech recognition model.

Streaming enables efficient compute and low latency. Through bounding the input context to the transformer model encoder, streaming avoids repetitive/redundant computation over earlier speech. It is an important requirement for efficient on-device implementation such as in a smartphone (e.g.: for real product).

This new model matches the performance of models 4x the size while running 3x-8x faster. As ever it is revealing to use standardized comparison with other models and the different size versions of this new model rank highly on the HuggingFace OpenASR leaderboard.

Bravo!