A team of Microsoft researchers have published information on DeepSinger, a technology currently in development, described as possibly “the first SVS [Singing Voice Synthesis] system built from data directly mined from the web”.
According to their research paper, DeepSinger was built from scratch, drawing from songs (or “singing training data” as they call it) mined from various music websites.
Songs were separated into vocal and instrument tracks, and then further segmented into individual sentences and phonemes. Data drawn from this was then filtered and used to create a “singing model” based on another Microsoft text-to-speech technology in development, FastSpeech. The result is a synthetic voice trained to ‘sing’ in Chinese, Cantonese and English.
Listen to audio samples from DeepSinger, here.
While the technology is still too young to see commercial implementation yet, it may have implications for the future of the music industry. Its data could be employed to improve existing AutoTune technologies, or it could be implemented in commercial music production in a similar way to Japan’s already well-established Vocaloid market.
However, as with other burgeoning artificial intelligence (AI) driven speech technologies, the concern remains that this may pave the way for malicious deepfakes and copyright infringement in the future.
Earlier this year, Jay-Z’s record label, Roc Nation reportedly took legal action against a YouTube channel over deepfake videos that used AI to make Jay-Z appear to perform Billy Joel’s We Didn’t Start The Fire and the ‘to be or not to be’ soliloquy from Shakespeare’s Hamlet.
Elsewhere in their research paper, Microsoft researchers noted that in future, they would “leverage more advanced neural-based vocoders such as WaveNet, and jointly train the singing model and vocoder for better voice quality.”
Read the entire research paper, here.
For more music technology news, click here.