I've noticed that google translate and some other apps have this feature when doing text-to-speech audio output. The first press (trigger) of the audio button plays it at normal rate, the second press causes the audio to playback at a slower rate i.e. odd-number presses are normal speed, even-number presses are slower playback speed. This helps slower listeners like me to pick apart the words from the original speed's playback.
The ideal way is of course to have another recording with deliberate slow reading with emphasized individual syllables pronunciation. It has the added benefit of helping us identify the difference between deliberate pronunciation versus normal speech "slurs". However, it is probably costly to do this way.
In the absence of a second recording, the cheaper way to do this is probably via html playback rate adjustment. Though we'll have to figure out the "slurs" in the normal speech ourselves.
I have tried doing that (via javascript), and the audio sounded *really* unnatural for many of the files I tested it on. I may look back into it in the future, but I was not impressed by the first time I looked into it.
You're right that by reducing playback rate we're letting the browser's algorithm generate audio frames where there is none, which sometimes make sound artifacts and worsen the output quality.
I've tried with one sentence audio:
let audio = audio = new Audio('https://iserve.renshuu.org/audio_reibun_norm/35831.mp3'); audio.play(); // test play audio.playbackRate = 0.75; // set to 0.75 playback rate audio.play(); // test play at 0.75 playback rate audio.playbackRate = 0.5; // set to 0.5 playback rate audio.play(); // test play at 0.5 playback rate
I'm running firefox browser v112. The audio is a little bit weird with 0.5 rate, but 0.75 is less weird.
I understand that there's a large library of audio, and effects may be different for all of them due to production environment differences.
Perhaps, it can be considered for "experimental" limited release flag, default off?