This text was originally performed as a Zoom essay at CCA NTU Singapore. An archival version with embedded audio is available here, and in German translation here. But we have also extracted the text for the purposes of legibility.
A young white man is sitting at a white desk. In front of him is a light grey DECstation 5000/200 computer and a black microphone. A simple melody plays. He begins tapping out a rhythm on the desk with his hands. As he modifies his tempo, the system responds. He is improvising. The machine is listening.
Is the man playing the computer, like an instrument? Or is he playing with the computer, as in a duet? Is he even playing at all?
The man appears slightly bored, pretending not to be aware of his own performance, exploring the limited freedom offered to him by the machine, which tirelessly repeats the melody again and again, infinitely. We are watching a breakthrough moment in human-computer interaction. The computer is doing what the man wants. But still, the man can only want what the machine can do.
The fantasy of easy, natural interface between man and a computer is captured in a diagram by Andrey Yershov from 1964 titled the ‘director agent model of interaction’.[2] The man is meant to be in charge. But look closely. We can start anywhere we like. Information cycles around and around, in a constant state of transformation from sound to voltage to coloured light to wet synapses. All of these possibilities are contained in the schematic figure of the arrow. Which is the director here? And which the agent? The diagram itself cycled between the pages of different publications, including The Architecture Machine, a 1970 book by Nicholas Negroponte.[3] Negroponte had set up the Architecture Machine Group at MIT in 1967, which eventually led to his creation of the MIT Media Lab some eighteen years later.
Figure 2: Andrey Yershov, ‘Director Agent Model of Interaction’[4]
Look at the film again. A state of the art, light grey machine is sitting on a white desk. A camera is pointed at it, focused on it. The camera zooms out to reveal a young, white man. Why are we seeing this moment? Why is the camera there to witness it? Judging by the DECstation, the year is probably 1992 or 1993, the location is definitely the MIT Media Lab, and we are looking through a small window into the ‘demo or die’ culture that Negroponte famously instigated there. Demos could excite the general public and impress important visitors. They could attract corporate and government money. The colossal ‘machine listening’ apparatus that we know today has its roots in thousands of demos like this one.
In the 80s and 90s, the demo was a prefigurative device also familiar to the music world. This man, like many of those he worked with, moved between these worlds. He was an engineer, but also a musician. He had come to MIT, in fact, for a PhD in the Music and Cognition group at the Experimental Music Studio, which had been founded by the composer Barry Vercoe in 1973 and was absorbed into the Media Lab from the very start in 1985.[5]
One of the first students to join this group was another musician-engineer named Robert Rowe. Rowe’s doctoral thesis ‘Machine Listening and Composing: Making Sense of Music with Cooperating Real-Time Agents’ seems to be one of the earliest uses of the phrase ‘machine listening’ in print. ‘A primary goal of this thesis,’ Rowe writes, ‘has been to fashion a computer program able to listen to music’.[6] The term ‘machine listening’ would go on to be taken up widely in computer music circles following the publication of a book based on Rowe’s thesis, in 1993.[7]
The following year the Music and Cognition group rebranded. ‘I have been unhappy with ‘music (and) cognition’ for some time,’ one doctoral student emailed to a colleague.
‘It’s not even supposed to describe our group; it was the name of a larger entity including Barry, Tod, Marvin, Ken and Pattie that was dissolved almost two years ago. But I’ve shied away from the issue for fear of something worse. I like Machine Listening a lot. I’ve also thought about Auditory Processing, and I try to get the second floor to describe my demos as Machine Audition. I’m not sure of the precise shades of connotation of the different words, except I’m pretty confident that having ‘music’ in the title has a big impact on people’s preconceptions, one I’d rather overcome.’[8]
‘Machine Listening’ it was. So what began, for Rowe, as a term to describe the so-called ‘analytic layer’ of an ‘interactive music system’[9] became the name of a new research group at MIT[10] and something of a catchall to describe diverse forms of emerging computational auditory analysis, increasingly involving big data and machine learning techniques. As the term wound its way through the computer music literature, it also followed researchers at MIT as they left, finding its way into funding applications and the vocabularies of new centers at new institutions.
One such application was authored by a professor at Columbia named Dan Ellis.[11] This is the man sitting at the desk and the author of the email above. Today he works at Google for their ‘Sound Understanding Team’.[12] As Stewart Brand once put it, ‘the world of the Media Lab and the media lab of the world are busily shaping each other.'[13]
Google’s ‘Sound Understanding Team’ is responsible, among other things, for Audioset, a collection of over 2 million ten-second YouTube excerpts totalling some six thousand hours of audio, all labelled with a ‘vocabulary of 527 sound event categories’.[14] AudioSet’s purpose is to train Google’s ‘Deep Learning systems’ in the vast and expanding YouTube archive, so that, eventually, it will be able to ‘label hundreds or thousands of different sound events in real-world recordings with a time resolution better than one second – just as human listeners can recognize and relate the sounds they hear’.[15]
AudioSet includes 7,000 examples tagged as ‘Classical music’, nearly 5,000 of ‘jazz’, some 3,000 examples of ‘accordion music’ and another 3,000 files tagged ‘music of Africa’. There are 6,000 videos of ‘exciting music’, and 1,737 that are labelled ‘scary’.
In AudioSet’s ‘ontology’ ‘human sounds’, for instance, is broken down into categories like ‘respiratory sounds’, ‘human group action’, ‘heartbeats’, and, of course, ‘speech’, which can be ‘shouted’, ‘screamed’ or ‘whispered’. AudioSet includes 1500 examples of ‘crumpling and crinkling’, 127 examples of toothbrushing, 4000 examples of ‘gunshots’ and 8,500 ‘sirens’.[16]
This is the world of machine listening we inhabit today; audited by Borgesian ontologies like this one, as distributed across proliferating smart speakers, voice assistants, and other interactive listening systems; that attempts to understand and analyse not just what we say, but how and where we say it, along with the sonic and musical environments we move through and are moved by. Machine listening is not only becoming ubiquitous, but increasingly omnivorous too.