SUTD Researchers Develop First-of-its-kind Artificial Intelligence Model that Controls Music with Prompts

04 Dec 2023

Dreaming of making your own music like Taylor Swift only you have no musical inclination at all?

Fret not, Mustango¹ is here.

Researchers from the Singapore University of Technology and Design (SUTD) have developed the first-of-its-kind music-domain-knowledge-inspired Artificial Intelligence (AI) model that generates music from text prompts.

Known as Mustango, this AI model is able to generate music that can be controlled, not only with general text captions, but with richer/more detailed captions that could include specific instructions related to chords, beats, tempo, and key – something that other AI models are unable to control.

Mustango, which took SUTD Assistant Professor Soujanya Poria and Assistant Professor Dorien Herremans and their team of PhD students, Mr Jan Melechovsky, Mr Zixun Guo and Mr Deepanway Ghosal, as well as post-doctoral researcher Dr Navonil Majumder, about six months to develop, had evolved from TANGO, a text-to-audio AI model that produces sound effects, including human speech and music, from text. For example, users could type "make the sound of a barking dog with a huge waterfall behind it" and TANGO would be able to generate the sound within a few seconds. TANGO's novelty was such that it even drew interest from a Hollywood sound designer who wanted to explore the possibility of licensing the model.

Said Dr Poria: "Mustango democratises music generation research by open sourcing its model architecture and training dataset, addressing the gap in progress compared to Large Language Models (LLMs) and text-to-image generation. With a unique focus on controllability, Mustango enables users to input a chord sequence and specify tempo preferences, providing unprecedented flexibility for music composers, sound designers, and podcasters."

For 30 days, Dr Poria and Dr Herremans' team "trained" Mustango with a novel MusicBench dataset. Due to the limited availability of open datasets of music with text captions, the MusicBench dataset was created in-house using an innovative data augmentation method which included altering the harmonic, rhythmic, and dynamic aspects of music audio. The research duo then used state-of-the-art Music Information Retrieval methods to extract the music features, which were then appended to the existing descriptions in text format. As a result, Mustango is able to generate music that can be controlled in terms of chord sequences, beat, key and tempo. Through extensive experiments, the quality of music generated by Mustango has outperformed those generated by other AI models in terms of desired chords, beat, key and tempo.

Said Dr Herremans: "Imagine DALL-E² but for music. When designing our model, we envisioned how a generative music AI that understands music theory text prompts could ease the lives of digital creators and composers alike. In 1843, the world's first programmer Ada Lovelace hinted that computers could be used to 'compose elaborate and scientific pieces of music'. It was not until this year that we really saw a breakthrough in directly generating music audio with AI models. It's exciting to see this evolution in music technology and I am excited to see that Mustango is already being used by many digital creators."

Currently, Mustango is available for use/testing on Hugging Face, an open-source data science and machine learning platform and is gaining traction within the community³.

The team hopes to continue its research and expand Mustango’s use by enabling it to generate longer sequences of music and a singing voice.

Dr Chen Zhangyi, Assistant Professor (Music Theory and Composition) of the Yong Siew Toh Conservatory of Music, National University of Singapore, tested Mustango and found most of the generated clips sounded quite accurate, with timbres, styles and mood that are indicative of the text prompts.

He said: "The specific musical elements such as tempo, key, chord progressions seem to work quite well. My immediate response is that this may be applicable to create short clips for social media audio, jingles, and film and game music. From a composer's point of view, I think it can serve as a quick resource, to generate a rough idea of what a piece could be, with the wildest mix of styles, genres and sounds one could think of. By restricting oneself to write a very specific text prompt, it could possibly be a very helpful tool to focus in on what the music should sound like, and speed up the preparatory work a composer might do before writing a work. Incorporating this innovative text-to-music AI in the creative process could be also a great way of creating and learning musical material, side-by-side conventional music research and compositional craft."

Sample music clips from Mustango.

1 https://arxiv.org/abs/2311.08355. Melechovsky, J., Guo, Z., Ghosal, D., Majumder, N., Herremans, D., & Poria, S. (2023). Mustango: Toward Controllable Text-to-Music Generation. arXiv preprint arXiv:2311.08355.

2 DALL-E is a neutral network by OpenAI that creates images from text captions.

3 Mustango can be found here: https://huggingface.co/spaces/declare-lab/mustango. Mustango received high GPU utilisation and tweets about the AI model has received more than 51K views in less than a week.