Table of Contents Show
So far, we’ve only scratched the surface of the question: “Yes, but what am I listening to anyway?” So now it’s all about content. It’s not so easy to say in general terms where 3D audio actually enables good content. Depending on the context, there can be a completely different technology behind it in the form of formats or game engines.
That’s why I’ve come up with a structure that I’ll simply call the “3D audio matrix” – or pyramid? Be that as it may, the whole thing is intended to provide a reference for applications and their prime examples, the advantages and disadvantages of 3D sound, and of course an overview of formats and tools with their respective peculiarities. You have to take a few steps to understand this: 3D audio is not just 3D audio. So let’s first go back several dimensions to the origin, i.e. from 3D audio to 2D, 1D, 0D..


















The 3D audio matrix overview
How, 0D? Admittedly, that’s a bit abstract. How can you visualise it? I’m taking a mathematical approach here. Don’t worry, it won’t be any more difficult than in your first geometry lesson. Let’s imagine a coordinate system in which our head is at the origin.
Now let’s combine this with the question: what kind of geometric object do I have?
0D: A point in the coordinate system without spatial information, the origin (0|0|0)
1D: A line, here I can move from left to right (x|0|0)
2D: A plane, now I can also move forwards and backwards (x|y|0)
3D: A cube/sphere that adds height information (x|y|z)
Audio formats that you already know
So far so good. Now imagine you want to place an audio object in a room. There are already audio formats that we know from our everyday lives.
0D: Mono. No room information can be added.
1D: Stereo. You can at least move your sound to the left and right
2D: Surround. With 5.1, for example, you can also place sound at the back
3D Audio: Here you can also move sound up or down.
But wait, there’s more: degrees of freedom
All overviews in this direction that I know of stop at Dolby Atmos, but beyond that it’s just getting started with 3 or 6 degrees of freedom. OK – what are degrees of freedom? Also known as DoF (Degrees of Freedom), the degree describes the following.
- 0DoF: It is not defined where you actually look while consuming the content, except that you look forwards, like in films, you are not supposed to turn around.
- 3DoF: Here you can also rotate your gaze, as we know it from 360° videos. Also known as head tracking (rotation).
- 6DoF: This concept is already familiar from 3D games, where the player also moves through a 3D space (translation).
Headphones simplify understanding
It’s as simple as that – or not. Because it’s always a question of which direction I’m looking at my concept from. Here I am referring exclusively to headphone playback. Do some of you remember the localisation and externalisation in the head from the last article? This is a good example of how mono and stereo are always perceived in the head. Even if I add reverb to create a depth gradation, I can still only move an object to the left and right and only create a difference in volume and time (ILD, ITD). But the third factor (HRTF) is missing, with which I can really differentiate between front and back. For me, stereo is therefore one-dimensional (lins right), which is often confused with the two channels.
With surround sound via headphones, a binaural attempt is made to generate an impression of the front and rear – with 3D audio, this also includes height information. This is where the aforementioned HRTF comes into play in order to really “let the sound travel out of our head” during the calculation. So we go from an in-head localisation to an externalisation. Although at the end of the day, a two-channel stereo signal is played back. But watch out! It’s not “normal” stereo, but binaural, with HRTF-filtered extension.


The walls of the concept are shaking for loudspeakers
It’s not so easy for loudspeakers, because even if I only have one loudspeaker in the room, it’s still “somehow three-dimensional”. Besides, the term stereophony actually says: more than mono, so it would even include 3D audio, but I think that in everyday production everyone thinks of stereo as a “two-channel audio file”. With stereo
Playback, you place two speakers with yourself as the third point in an equilateral triangle. And as we know, triangles are actually two-dimensional. Nevertheless, I can only move my sound between the two loudspeakers, not beyond them. Even if I add reverb, you get the feeling of depth
Feeling of depth, but you don’t know whether the room is actually in front or behind you. Nevertheless, two-channel stereo is still a stable reproduction method and, in my opinion, not broken, regardless of what the various hectic marketing newsletters claim.
0-2D aka “normal media”
OK, of course I could go on and on about where to find mono, stereo and surround content. But that’s everyday life. We use mono every day for voice messages, we stream music in stereo and if you have the right TV, you can watch films in surround. The more I think about it, the more I realise that surround is not that far removed from 3D audio in this respect. You still have height information – applause. So the quantum leap from stereo to surround actually seems greater than that from surround to 3D. I would also subscribe to this for film, for example, because when streaming, the great 3D sound from the cinema has to be compressed and is then usually a 5.1 that tries to retain a few treble elements.

Audio should be immersive
But we remember that immersive audio should ideally be so natural that we don’t even think much about the technology. And 3D audio brings us a lot closer to this impression than surround sound alone.
However, there is another factor why 3D audio can bring even more benefits under the bonnet than height information. The surround formats meant here are either quadrophones 4.0, 5.1 or 7.1. The numbers are channel information, e.g. for 5.1, channels 1 to 6 are: Front Left, Front Right, Centre, LFE (low frequency effect), Left Surround, Right Surround.
So if you want height information, you need even more channels, such as 5.1.4. You then have four more speakers on the ceiling. But as you can guess, that’s kind of impractical. And what if I have a 7.1.2 system, how are the channels converted? That’s why audio technology is moving away from channel-based formats and towards so-called NGA, next-generation audio formats.

Object-based audio
To understand why 3D audio can be even better than surround sound, let’s take a brief look at object-based audio. A major advantage of object-based audio is the independence of the channels, as the rendering only takes place at the end user. Systems that want to use Next Generation Audio must therefore have a corresponding decoder integrated. This ensures optimised audio playback at all times.
Another exciting possibility in addition to the movement of audio content in 3D space is the personalisation of these audio objects. My favourite example is watching a football match, where I can simply mute the “Commentator:in” audio object. Podcasts would be an equally exciting application. Let’s say we want to listen to a news podcast that is an hour long, but we only have 10 minutes. We tell our smartphone this and the podcast is automatically shortened to the most important 10 minutes using metadata.
MPEG-H is able to do this and is already standard in broadcasting in Korea and Brazil. The biggest competitor is AC-4 aka Dolby Atoms, which only allows such personalisation and interaction according to its own specifications.

Spatial Audio 0DOF with/without picture?
Let’s take a look at what 3D audio content is really available now. Most of them should be familiar, because thanks to Dolby Atmos, consumer favourites such as films and music streaming services are currently being supplied. Podcasts too, but in the vast majority of cases they make more sense in stereo – or even mono.

3D audio is better than 3D video?
When I tell people that I do “something with 3D audio”, they immediately ask if I work with “Dolby Atmos”. In fact, Dolby is not that relevant in my immersive audio bubble because it doesn’t enable many things that I need in my daily work. But more on that when it comes to degrees of freedom.
Nevertheless, Dolby Atmos (AC-4 is actually the audio format behind the marketing term) is particularly relevant in the film industry. Thousands of Hollywood blockbusters have already been mixed in this format and people are happy to spend a few euros more to enjoy a surround system in the cinema. Sure, the sound experience is more fun when a helicopter suddenly sounds like it’s flying over your head. But at the end of the day, I would argue that all films also work with stereo sound – or mono. You don’t look in all directions, you only look forwards.
Even though I like to shoot in the direction of Dolby Atmos, they do a good job of bringing this surround sound into the living room. Many soundbar models now support playback via the TV using streaming apps. And playback via Apple headphones is also really fun, even on an iPad. Although the “Airpods Pro” are in-ear headphones, you have the feeling of being enveloped by the sound and can almost save yourself an expensive home cinema. Now let’s take a look at pure audio enjoyment without visual content: Music streaming has long been part of our everyday lives and is becoming increasingly popular. 3D music streaming is the latest innovation in the industry, adding a third dimension to the listening experience.

3D music
But now Dolby came up with the idea of converting their 3D audio format for music production. Admittedly, I’m a little sceptical here too, because I felt that most songs sounded better in stereo than with the 3D audio formats. In addition to Dolby Atmos Music, Sony is trying to add 360 Reality Audio to the list of 3D formats.
However, one advantage is definitely that you have to make fewer compromises when mixing music and have more options for placing the individual audio tracks. This gives some tracks more depth and you can hear the individual instruments better. You have the feeling that the musicians are sitting around you in the studio.
We are currently in a major learning phase here, similar to the move from stereo to mono. The first Beatles songs sound interesting by today’s standards. Today we know better how to mix in stereo. The same applies to these 3D music productions. So just see for yourself which streaming platform is already there and listen to it. I think the quality is getting better and better and since Apple Music got involved, there’s been a bit of a gold-rush atmosphere in the audio community.
And what is 8D Audio now? Admittedly: maximum confusion to categorise this again, but 8D doesn’t even stand for dimensions, but directions. Let’s just accept the phenomenon and define it for what it is: music circles around our heads in mono and works quite well through headphones. But in the long run it is a bit tiring and monotonous, so 3D music tries better. Here you work with individual tracks of the various instruments and place or move them through the room where it suits the composition. It rarely sounds as clearly 3D as 8D audio. So just listen to the examples that sound better or worse on my blog and form your own opinion.
Dolby Atmos Podcast
The last remaining audio-only format is podcasts, which are now also being tackled by Dolby Atmos. Here, however, my toenails roll up a little, for reasons that go beyond the scope of this article. The short version is that Audible has now jumped on the bandwagon, but AC-4 as a 3D format does not allow certain sound sources to be made non-3D. With audio dramas in particular, this means that the narrator’s voice is suddenly in the 3D scene with the protagonists, which means that the listener can no longer really distinguish who is actually part of the action.
But I’m already in dialogue with Dolby about this too, because what’s the point of always complaining 🙂 But there are already other courageous productions that have mixed with or without Dolby in order to get answers to the questions of how well radio plays work as 3D audio productions. Just listen to it yourself and form your own opinion. Head tracking comes later.
Most productions have “the problem” of producing too classically. In other words, recording speakers in the studio, then letting them move around virtually with 3D spatialisers and adding an atmo. But that doesn’t really sound convincing or immersive. Can you cite your own productions as a positive example? Then I’ll throw my radio play for BKW Engineering into the ring. A more classically produced one is the “Erdsee-Hörspiel” from WDR. I’m curious about the opinions.
Advantages and disadvantages of 3D sound
Said productions with 3D audio can be fun if they fit the content well. It is also often used as a marketing gimmick to simply offer the listener something new, to have a unique selling point. This is also a disadvantage, as 3D audio is not a seal of quality. However, you can usually rely on your ears to determine whether the production works for you or not, apart from your own taste.
That’s why productions are often made in 3D that would probably have worked just as well in stereo. Especially in the music sector, there are genres that have spatialisation but, in contrast to the stereo version, have less pressure. In addition, the conversion to headphones is not yet perfect. The soundtrack often sounds somehow duller than you are used to with stereo directly to the ears.
Here, too, we are in a transitional phase. Users first have to get used to this spatial sound again. It usually works better via loudspeaker systems and I have to admit that I had a lot of fun with Dolby Atmos Music in a demo car.
However, as you can imagine, with 3D audio mixes there are even more parameters that I can set for a sound. That’s why such mixes are usually more complex, more time-consuming and more expensive than a stereo mix. And, as I said, you usually need a selection of special devices to really get the full benefit.

Formats and object-based audio
As mentioned, all formats are somehow based on the assumption that the
the listener is looking towards the front, where there is usually a centre speaker or screen. You don’t actually want people to look across the room because there is usually a TV picture in front of them while they are listening to 3D audio content.
As I said, Dolby Atmos has established itself in the form of AC-4 with a large marketing budget and is spreading from films to music, podcasts and gaming. The alternative is MPEG-H made in Germany, which is particularly suitable for live streaming in the broadcast sector. The competition from Dolby Atmos Music is an adaptation of MPEG-H, called Sony 360 Reality Audio, which should provide a boost for the Sony Music label in particular. Both formats can even be found on Amazon Music, although Dolby already has around four times as many tracks, so the battle seems to have been decided.
One format that has not established itself for over 30 years, but is currently experiencing a renaissance, is Ambisonics. This sound field-based format has audio channels, but instead of loudspeakers it maps spatial axes. It all sounds a little unusual and only has a very small sweet spot when played back via loudspeakers. However, this disadvantage does not exist with headphones because you have the perfect playback position directly on your ears. The format can also be easily rotated around the X, Y and Z axis. This is why it has established itself more for the use of 360° videos and thus into the world of three degrees of freedom.
New audio freedoms in three degrees
With advances in technology, new 3D audio techniques are opening up unprecedented possibilities for media production. 3D audio with three degrees of freedom (3DOF) is one such concept that enables unprecedented immersion and dynamic sound experiences that benefit from the fact that you’re not just staring straight ahead.
Below we look at the pros and cons for specialised headphones, applications where it can be used effectively and different formats available when implementing this feature into a production workflow.
360° videos from a sound perspective
Probably the best-known representative of this genre are 360° videos. These spherical moving images triggered a real hype half a decade ago. Suddenly you could watch such videos
on the largest video platforms Video YouTube and Facebook. Strictly speaking, there are many types of user experience:
- On desktop devices, using the mouse to turn your gaze while looking further forward at the screen.
- On smartphones, where you hold your device in front of your nose, not turning your head, but rotating your body on its own axis.
- Head-mounted displays (HMDs) are in the top class because the image really does adapt to the movement of your head in real time. Such 360° videos are also available as stereoscopic videos, as 3D videos, if you like, where each eye gets to see its own 360° panorama, creating a more vivid image in the brain. But of course there are also 360° videos with 3D sound. In this context, I also like to call it 360° sound, because you immediately understand that the sound is also spherical like the image.
Audio head tracking
If you now remove the image component, but still want the sound to react to head movements, then we find ourselves in the world of audio head tracking. Apple is already building this technology into ALL Airpods, from the entry-level Airpods to the Airpods Pros and, of course, Airpods Max. However, the use cases are currently limited.
Technically, it is now possible to listen to Dolby Atmos Music tracks via Apple Music, but, as I said, these tracks were never mixed with the intention of “listening around” in the music. In addition, Dolby metadata is even bypassed so that this feature can be activated at all. As a result, Dolby Atmos Music tracks sound different on Amazon than on Apple Music – not exactly what you want to hear as an audio engineer. But Apple needs content to be able to use its technology as a selling point.




3D sound is more than entertainment
But there are applications that actually solve problems with 3D audio and head tracking and are not just a fun factor. The medium is also becoming increasingly relevant for communication. MS Teams has added the “spatial audio” feature to its video calls. The good thing is that you don’t need any additional hardware. Simple headphones are all you need and the microphone signal is automatically spatialised in the cloud for the other users – even without head tracking.
In a video conference with several people, things can quickly become chaotic as the current mono system has problems keeping the different voices apart. Our brain has difficulty differentiating between the voices as they all come from the same direction. 3D sound makes the conversation situation natural and actually makes it measurably easier to listen, because it takes the strain off our brain – similar to the cocktail party effect. The 3D audio spatialisation of voices makes it much easier to differentiate between them and noise is less noticeable.
Head localisation is being fought – wrongly
Films can also be watched with head tracking, but there is a fundamental problem here: the approach is to make the entire sound 3D. However, this means that even elements such as narrative voices or background music are part of the scene where they shouldn’t be. Here is a very brief explanation on the subject of diegesis. 360° videos are the most vivid. Everything you can see should also be three-dimensional in terms of sound, as you are in the scene (diegetically). But it shouldn’t be a narrative voice that can’t be seen (non-diegetic). Otherwise you will hear a ghost coming from somewhere and wonder who is talking to you. But if the voice is played in mono, it doesn’t change and you immediately understand that a person is talking to you who isn’t even part of the scene.
In other words, in the world of 3 degrees of freedom, not everything is just 3D audio so that you have the feeling that it is coming “from outside”. Rather, the ability to combine the soundtrack with mono or stereo signals via headphones is important so that listeners understand which sound is coming from which narrative level. This is not possible with loudspeakers because they are always perceived from the outside, whereas with headphones you can – and should – make use of the localisation in the head.
Confused? Then let’s go round in circles again
Apple does a good job of being able to understand whether I’m listening to stereo or multi-channel content. This is all the more interesting when you realise that Bose burnt its fingers on this very subject years ago. In Cupertino, however, they believe in the technology and have already built it into every pair of in-house Airpod headphones. This also helps with the market launch of the Apple Vision Pro, but more on that later. When it comes to audio playback alone, a distinction is made not only between the two options, but a total of five.
Input: Stereo.
You can hear the sound as normal stereo, nothing special at first. However, if you don’t like this upside-down localisation, you can activate “Spatialize Stereo/Stereo to 3D Audio”, which adds a reverb algorithm to the signal to make it sound more “natural”.
In addition to this spatialisation, you can also activate head tracking, which makes the signal sound through headphones as if you were listening to it through two speakers in the room (3DoF).
Input: Multi-channel
Mostly Dolby Atmos via film platforms such as Disney or music streaming such as Apple Music. The sound is automatically converted from multi-channel audio to binaural stereo so that the sound sounds as if it is happening around you. This setting makes the most sense if you move around and don’t want the sound to change all the time as you move your head.
You can also activate head tracking, which is particularly useful for films if you have a visual reference point or if you want to distinguish objects from the front from those from behind (which is always difficult with binaural sound). I would call this level 3DoF.
Freedom comes with pitfalls – advantages of head tracking
A big advantage of this technology for 360-degree videos is the fact that you can now hear better when something is happening behind you, for example. Something you wouldn’t be able to see because we only have a limited field of view, but our hearing is always mapped in 360°. I have often seen VR experiences that slap their beautifully designed visual scene with arrows so that people look in the right direction at the right moment. Cleverly placed 3D sound can solve this problem intuitively.
It also solves one of the biggest obstacles of binaural audio. You often have the feeling of spatiality and that the sound is happening around you. However, you can rarely distinguish the front from the back. But if you are able to turn your head even slightly, you can immediately understand which sound is where. Head tracking does not mean that you have to move 360° – but you can. This solves the aforementioned problem of narrative levels. With films, you don’t have to ask yourself whether an element is dynamic or not. When we listen to epic music in the cinema, we don’t ask ourselves “where is Hans Zimmer now”. But because you now have this clear separation, you have to question how you actually use voice-overs and music. In most cases, a scene with well-designed sound effects is better than desperately trying to keep people entertained with music and speech. The brain is usually already well supplied with 360° images anyway, so three levels of sound (speech, sound effects, music) are more likely to cause confusion.
Disadvantages of head tracking
As already mentioned, the whole thing is not so easy to implement via loudspeakers. Theoretically, it is also possible to display 360° videos as a projection in planetariums, for example. For surround sound, you are surrounded by loudspeakers. But the speaker’s voice still somehow comes from one direction. What still works well in the cinema suddenly becomes a problem with spherical videos and a series of workarounds and compromises are necessary.
Unfortunately, you rarely know when listening whether the mix should be heard with head tracking or not. There are mixes that really fall apart when you have the opportunity to turn your head. While other productions don’t even make sense if you’ve deactivated head tracking.
Admittedly, I always talk so cleverly about what you should and shouldn’t do. But the reality is simply that there is usually neither the time, money nor knowledge to make your immersive media production really good. If anything, an Ambisonics microphone is put up and labelled as immersive audio. Only to be mixed in stereo in the end for budget reasons “because nobody can hear it anyway”. It may be that listeners who come into contact with immersive media for the first time don’t necessarily scrutinise the sound. But the more points of contact you have, the greater the desire for sound. All the top podcasts are now produced in a studio, even if they started out as a hobby – there must be some kind of quality reason for this 😉
Formats for audio head tracking
I have already mentioned the most famous representatives with Dolby Atmos, 360 Reality Audio (based on MPEG-H) and Ambisonics. However, these are all 3D formats that were not primarily developed for audio head tracking.
That’s why I don’t want to go into the technical details here, but rather briefly explain why Apple is once again showing a very good approach here. Even if Dolby likes to describe its format as future-proof, at some point it will reach its limits.
As mentioned, in the world of degrees of freedom, it’s not just 3D sounds that are relevant. But also precisely those non-diegetic sounds that are not part of the scene, but make sense for music and voice-overs in 0D. But of course there is more than just black and white thinking, i.e. 0D and 3D. Because there has to be something in between. This is often referred to as a bed. Apple refers to the three factors as:
- 3D Audio Objects
- Ambience Bed
- Head-locked audio
The 3 layers for immersive soundtracks
we have already looked at 3D Audio Objects, which are the objects that I can place in the room. You usually have the option of setting distance parameters or the size of an object so that it doesn’t stand out from the scene. Let’s take a cheering 3D audio fan in a stadium as an example. Then I would like to have a reverb that gives you the feeling of being in the same place. But if I simply add a reverb to the object, the reverb will only come from this corner. However, sound spreads in all directions, so it would also be heard all around us. Theoretically, I could send the reverb to our head-locked audio track. But then the reverb would not be 3D.
This is where the aforementioned bed comes into play. All signals that should be spatial but can be diffuse can be sent here. So if you have not just one fan, but hundreds, you would otherwise have to fill 100 audio tracks with objects. This way, you can simply send the group to the bed and only need a fraction of the audio tracks.
What do the other formats do differently?
Dolby Atmos, for example, works with a channel-based 7.1.2 bed and you can add up to 128 mono objects. However, it doesn’t actually have a head-locked stereo track because the format is based on loudspeakers. So for me it is not suitable for podcasts. In principle, the Dolby Atmos renderer offers the option of marking an audio
Object as “disable binauralisation”. This means that it is not played spatially. However, if you activate head tracking, the Apple renderer bypasses this meta data and only reads out where the object is located in the scene. This means that all Dolby Atmos mixes were never mixed with the intention of head tracking and therefore rarely utilise the advantages of the technology.
Ambisonics, on the other hand, has 4, 9, 16 or more channels, depending on the order. So it has a bed, I can even work with head-locked audio, but again it has no objects. Which is why the sound is always a bit diffuse, or I would have to spend a lot of audio channels to get close to the resolution of object-based formats. It therefore supports head-locked audio in mono, but not stereo. However, this optional stereo track is standard with Facebook360 and YouTubeVR, for example. An Ambisonics file is supplied, which rotates depending on the viewing direction. If required, an additional stereo file that is always played in the same way, no matter where you look in the 360° video. This gives you the best of both worlds and a good compromise between resolution and quality.
6DoF – anything but dumb
Let’s move on to the premier class of 6 degrees of freedom. Here you not only have three possible rotations around the X, Y or Z axis – but also three translations to these axes. In less unnecessarily clever terms, this means that you can also move towards or away from the sound, making it louder or quieter, for example.
All the applications discussed so far have been based on the fact that the listener is at the centre of the action, in the so-called sweet spot. This is where the optimum listening position is. But now we can move away from this point all at once and you can already guess that this makes sound design even more complex. So that there are no acoustic holes in the 3D scene or you are distracted by too many sound sources.

Games (more than a gimmick)
Here you will inevitably find yourself in game engines. This is why games are the best-known representative of this category. But not every game uses 3D audio. Once again, the genre question is a legitimate one.
A 2D game needs sounds from behind/above/below just as little as a strategy game in which I look down on my people from above like a god. Left/right is perfectly adequate here. 3D audio could be used here at most in the ambience, similar to films, so that you have more of a feeling of being part of the scene. Enveloped by sound is the keyword here again.
All games that take place in 3D worlds, especially first-person games, benefit from spatial sound. Fans of shooters have long known that being able to hear the enemies behind you before you get them in front of your virtual “weapon” can make all the difference in the game. That’s why gaming headsets are popular in this respect, so that your ears can tell your eyes where to look as well as possible.
However, such surround headsets are usually not even necessary to be able to hear spatially. We remember that our brain can do this with just two cleverly rendered audio channels. In most cases, the game automatically recognises whether you are using speakers or headphones and renders the sound accordingly. Nevertheless, it may well be that such surround headsets are even more customised to the software and, above all, enable communication. The Playstation 5, for example, advertises with the Pulse 3D Audio Engine and its own headset, which are very well matched to each other. Recently also in combination with Dolby Atmos in order to be able to control multi-channel soundbars.

AR (augmented reality audio)
Pokemon AR is often used as best practice for augmented reality applications. Even though I like to say that the game didn’t go viral because it was AR, but because it was Pokemon and had a great multiplayer character. The built-in cameras of smartphones are usually used here. The Lidar scanner, which enables even more precise tracking, is also increasingly being used.
AR glasses have not really become socially acceptable yet. Magic Leap or Hololens cost several thousand euros and are even slowly establishing themselves in the industry. Google Glass was way ahead of its time, but the features you would want from such a device are correspondingly limited. That’s why augmented reality is currently even more exciting from an auditory perspective. The technology here is already very advanced and, as already mentioned, headphones usually have at least head tracking built in. Combined with the smartphone, experiences with six degrees of freedom are also possible. Applications could include audio guides in museums, where the paintings or statues are brought to life by sound and tell their life stories, for example.
VR (sound for virtual reality)
When it comes to VR applications, most people also think of gaming. This is supported by the fact that such projects are almost exclusively developed with game engines such as Unity or Unreal. However, the VR bubble is much more diverse than you might think. Training and simulations in particular are currently in the B2B industry without consumers realising it. The possibilities are virtually unlimited and, in addition to the aforementioned rollercoaster games for which the medium is mostly known, real use cases are establishing themselves that offer added value – such as saving time and money when training employees. Nevertheless, it is not easy to transfer games, apps and the like from 2D screens to VR. The user experience with HMDs and controllers is simply fundamentally different. In most cases, VR attempts to replicate reality and is just a poor digital copy. However, the immersive medium really comes into its own when you do things that you can’t do in real life.
A prime example is “Notes on Blindness” (is.gd/notes_on_blindness). A VR experience in which you slip into the character of someone who is slowly going blind. Automatically, even non-sound people pay much more attention to the subtle nuances in the sound. But in the vast majority of cases, sound is usually neglected by developers due to a lack of knowledge and time. That’s why most apps sold as immersive experiences sound quite sterile. We are a long way from AAA budgets and need to give this young medium some time.

Spatial computing meets spatial audio
Let’s not get confused by Apple’s new term. For me, spatial computing is the same as XR (eXtended Reality). You are in virtual reality, extending your reality or somehow in between. The boundaries are no longer so easy to separate when even VR glasses have cameras that look into reality. So if I see reality through an HMD, is that VR or AR?
Apple doesn’t make the confusion any easier because they didn’t want to use terms that are already used by other companies. Virtual reality or the metaverse from Meta. Or mixed reality from Microsoft. In the long term, the result will be that we don’t have a VR device and an AR device, but a device that can depict all realities. I won’t comment on how AI and other buzzwords such as blockchain will play into this 😉
For me as a sound engineer, however, it is important to separate which sound is part of which reality. In VR, I want to isolate myself, so I use a surround sound that matches what the display tells me. Whereas in AR, I want the sound to sound as if something is happening in my living room, where I am right now. Apple is also closer to a solution here than other companies because Apple Vision Pro, for example, also introduced ray tracing, which recognises the geometry of our surroundings and renders the sound accordingly.
In addition, Apple has already built a good infrastructure for 3D audio with the Airpods. If you want to further optimise the sound, you take pictures of your ears, which generates a personalised HRTF. Our hearing as a 3D model, so to speak. This allows the iPhone, for example, to tune the sound for us even more precisely and the distinction as to whether the sound is with us in VR or AR becomes even clearer.
Game audio – playful or gambled away?
Gamers know how important it is not only to see your opponents in time, but to hear them beforehand. For this reason, many people like to spend good money on expensive headsets that supposedly give them an advantage in the game. However, the sound design in AAA games is also very well budgeted and therefore correspondingly complex. A short beep sound is enough for the player to know immediately what is happening in the scene.
Communication and collaboration in 3D audio and with 6 degrees of freedom are crucial aspects for the feeling of presence in social VR. I recently had the opportunity to be in Social VR myself and was surprised at how long I was in there at a stretch. Even though the room was virtual, afterwards I had the feeling that the other person was actually in the same room as me. The cognitive load on our brain is lower because the sound is not coming from just one direction, as is the case with video calls, but a natural conversation situation is depicted.
But augmented audio can also make our everyday lives easier when we are on the move. Everyone is probably familiar with the problem of travelling on the underground using a map service for smartphones. You arrive somewhere on the
somewhere on the surface, but have no idea where to go because the sat nav is confused as to which direction we are travelling in. GPS is simply too imprecise. But if you have a second reference system – in the form of surround headphones that know the direction you are facing – you could simply hear a voice from the direction you need to move in.
Game over with these problems?
We humans have been used to hearing in three dimensions since birth. Now we can finally approximate this impression naturally. It all sounds very simple, but to get back from stereo to the original, an extremely large number of parameters are required. Unfortunately, it is not enough to load an audio object into a game engine and tick the “3D Audio” box. Here is a brief overview of what is required for 3D audio:
Before you insert an audio clip into a game engine, you should ask yourself a few questions. Where did you get the sound from and does it fit in with the other sounds in your library or recordings that you may have processed with EQ? What is the purpose of the sound? Will it run in the background or will it serve as a trigger for special actions in the game? The secrets of game audio have evolved in recent years. Usually the result is to have loud sounds in a 3D world, but they never really work together.
Since you not only give the sounds parameters as to where they are in the environment, but you also move through the world as a character, there are a variety of parameters that you can give your audio object. As already mentioned, a distinction is made between mono and 3D sound and the size of the sound. Another important parameter is the attenuation curve, which regulates how quickly and how much the volume of the sound decreases in the room. By setting the focus parameters and the air absorption and occlusion factors, you can further determine how the sound spreads in the room.
So far so good, but so far the sound still subjectively sticks very close to your face. It has a certain distance, but our brain does not yet know what kind of room we are in. At the moment it is an abstract sound source in an empty room. So it’s time for reverb. An important consideration when creating realistic sound effects using 3D reverb is the size of the room and the material of the walls. Here, too, the calculation is usually only approximate. You would actually need real-time ray tracing to be able to realistically simulate initial reflections and reverberation. However, with the right combination of the parameters mentioned above, you can get quite close and don’t need a render farm.



If the software wasn’t so hardware..
Hard… you know? Anyway, in this context you usually come across the term “game audio”, which deals with the design of interactive audio content. The three new degrees of freedom create two new problems: You don’t know exactly when this player is actually where.
That’s why audio production doesn’t end up with one long audio file with several channels. Instead, many small assets are delivered. These can be loops such as a forest atmosphere, for example, which is repeated in the background until we are no longer in the forest. The second category is trigger sounds, for example if I want to hit a tree with an axe, a “tin” should also come at the right moment.
The only audio format that is still under development and that could depict everything is Fraunhofer’s MPEG-I. But that will take a few more years. However, this will take a few more years, which is why it is mostly found in game engines such as Unity or Unreal Engine. The former is only equipped with very rudimentary 3D audio features. With Epic Games, you can go further. Nevertheless, both platforms quickly reach their limits, which is why it is common to implement middleware for your project. Audiokinetic Wwise and FMOD are popular programmes for this and usually provide everything you need. And if not, you can always write your own scripts. Easier said than done, because here the sound designer has to become more of a developer. With endless possibilities but also complexity.
Conclusion on the large 3D Audio Matrix
To summarise, 6DoF makes it possible to move freely in space and changes the way we experience sound, initially in a playful way. Even if games have somehow been using this for decades, there is a much greater added value than entertainment. That’s why it doesn’t make sense for me to call 3D audio the stereo killer now, as Dolby likes to propagate. I’m starting with what wouldn’t have worked with stereo, true to the motto “Sound First”. In the sound community VDT (Association of German Sound Engineers) and AES (Audio Engineering Society) there is a lot of talk about “immersive audio”, but it’s mostly just about 3D music. And it feels like nerdy details that users don’t understand anyway.
Audio professionals have to ensure that the scene is set to music in such a way that it doesn’t sound empty but also not overloaded. Developers suddenly have to deal with very complex audio parameters that they have to operate on their own more often than I would like. However, the bigger the project, the more budget there is for the respective specialists and games can certainly be taken as a prime example in terms of creativity and technical realisation. It’s an exciting time for the audio world because 3D audio is really celebrating one breakthrough after another in a wide variety of areas. So if you have a project that you need help with, or would like to learn more about this area with my video course, just get in touch!