by Daniel Seah

When you purchase through affiliate links on MusicTech.com, you may contribute to our site through commissions. Learn more

Microsoft’s DeepSinger mines “singing training data” collected from the web

Could burgeoning speech technologies greatly impact the music industry of tomorrow?

by Daniel Seah

When you purchase through affiliate links on MusicTech.com, you may contribute to our site through commissions. Learn more

Get MusicTech breaking news as it happens by following us on Telegram: https://t.me/MusicTechOfficial

A team of Microsoft researchers have published information on DeepSinger, a technology currently in development, described as possibly “the first SVS [Singing Voice Synthesis] system built from data directly mined from the web”.

According to their research paper, DeepSinger was built from scratch, drawing from songs (or “singing training data” as they call it) mined from various music websites.

Songs were separated into vocal and instrument tracks, and then further segmented into individual sentences and phonemes. Data drawn from this was then filtered and used to create a “singing model” based on another Microsoft text-to-speech technology in development, FastSpeech. The result is a synthetic voice trained to ‘sing’ in Chinese, Cantonese and English.

Listen to audio samples from DeepSinger, here.

While the technology is still too young to see commercial implementation yet, it may have implications for the future of the music industry. Its data could be employed to improve existing AutoTune technologies, or it could be implemented in commercial music production in a similar way to Japan’s already well-established Vocaloid market.

However, as with other burgeoning artificial intelligence (AI) driven speech technologies, the concern remains that this may pave the way for malicious deepfakes and copyright infringement in the future.

Earlier this year, Jay-Z’s record label, Roc Nation reportedly took legal action against a YouTube channel over deepfake videos that used AI to make Jay-Z appear to perform Billy Joel’s We Didn’t Start The Fire and the ‘to be or not to be’ soliloquy from Shakespeare’s Hamlet.

Elsewhere in their research paper, Microsoft researchers noted that in future, they would “leverage more advanced neural-based vocoders such as WaveNet, and jointly train the singing model and vocoder for better voice quality.”

Read the entire research paper, here.

For more music technology news, click here.

#AI #Vocals

Get the latest news, reviews and tutorials to your inbox.

Subscribe

Microsoft’s DeepSinger mines “singing training data” collected from the web

Trending Now

1How Time Warp has achieved 30 years of hard-hitting techno and next-level stage productions

2Jacob Collier disagrees with Rick Rubin’s philosophies: “His audience is non-creative people for whom creativity is novel”

3Musician points out microphone mishap in forthcoming Bob Dylan biopic and fans are divided

4Watch “Synth God” Mike Dean play Moog’s highly anticipated new synth, the Muse