Exploiting domain structure for music ML tasks

Dr Cătălina Cangea

Cătălina is a Research Scientist at DeepMind, having previously obtained her PhD at the University of Cambridge, supervised by Pietro Liò. Her main interests are learning in challenging data scenarios and using these methods in real-world applications; besides that, she deeply cares about music and is excited about the prospect of developing ML tools that support people who wish to be (more) creative! In the past, she has done research on cross-modal architectures, V&LN and VQA, scene graphs, hierarchical & topological relational representations and Neural Process-based models.

Project

Learning to represent and generate music is a highly relevant task for the machine learning field. This data domain is ideal for density estimation tasks and exhibits many interesting properties, such as long-term dependencies and patterns residing at various scales. More crucially, music generation is a main pillar of creative AI, which complements the ML community efforts in pushing scientific progress and gives us the amazing opportunity to assist artists in their creative process [0]!

Existing state-of-the-art approaches for symbolic music generation [1, 2] operate on and produce sequences of tokens. However, the music domain contains structure at several scales: local (e.g. chords, arpeggios, multiple voices being played together in a fugue) and long-term/global (e.g. ABA form [3], the key that a piece is written in, repeating patterns/motifs). In that sense, there have been relatively few studies (e.g. [4, 5, 6, 7, 8]) that investigate explicit representations of structure for modelling and generating music. Each of these works has certain limitations—for instance, each piece modelled in [4] (see Fig. 2) is turned into a single graph by connecting all notes with temporal links, but does not capture the fundamental music-theoretical dependencies and insights that composers most likely used when writing the piece. Alternatively, the work in [7] and a few others look at rhythm-specific encodings only. Perhaps the closest modelling strategy to the one intended for this project is [8], where the authors choose to encode various musical relationships between bars of the score; however, it would be easier to work with more general graph representations of music, building up from first principles such as the circle of fifths / the tonality of a piece (e.g. use a sliding window and compute relationships / a graph within each local window and between windows at discrete points in the score - there are endless ways to think about this!)

This project aims to study the effects of leveraging music-theoretical graph representations in ML models. The goal is to encode symbolic music sequences in a principled manner that reflects the composition process and underlying structure up to a greater extent that previously. This encoding would then be passed as (additional) input to a model [1, 2, 9]. Finally, the effects on model performance would be studied in classification and/or generation scenarios (TBD based on time constraints).

Steps: 1. Download dataset(s), choose task(s) 2. Become familiar with basic music theory concepts and design 1 or more ways of encoding the structure in symbolic music (see Appendix A.2 [1] for a description of event-based MIDI representations; see Chordify in Resources; see Wiki) 3. Set up model codebase and obtain baseline performance on chosen tasks 4. Add relational structure to the model and find suitable ways to encode it 5. Interpret the new results and investigate changes in model processing (e.g. visualising attention in layers, emerging patterns) 6. Open-source code, to allow researchers to preprocess their own music data and build more graph-based ML models for music tasks!

Resources - Music Transformer GitHub codebase - MusicVAE GitHub codebase - Perceiver AR codebase (to be released soon) - MAESTRO dataset - MetaMIDIDataset - Lakh MIDI dataset - MusPy - A toolkit for symbolic music generation - Chordify - music21 Documentation - Symbolic Representations - Fundamentals of music - A structural way to encode music (https://en.wikipedia.org/wiki/Tonnetz) studied by https://ojs.aaai.org/index.php/AAAI/article/view/11880 (here, the resulting encoding was an image)

References [0] Music 2020 Archives - AI Art Gallery (http://www.aiartonline.com/category/music-2020/) [1] Music Transformer [2] General-purpose, long-context autoregressive modeling with Perceiver AR [3] Ternary form [4] Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance [5] StructureNet: Inducing Structure in Generated Melodies - IR Anthology [6] Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints [7] Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions [8] MELONS: generating melody with long-term structure using transformers and structure graph [9] A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music