AI that produces sound through analysis of a source video is impressive. Fooling humans is not. Since most of us have grown up on a steady diet of film and television many of the sounds we have in our memories are the work of foley artists that add sound effects to sequences in post. The sound of horse hoofs on cobblestones is likely created from a percussive technique that has no equine participation. The sounds of people being punched may be the sound of a large piece of meat being struck with a club. Similarly crunching snow likely is not the sound of a person walking through actual snow.
Our perceptions of sound within a video/film source is already deeply skewed and therefore the notion that this AI is a Turing test of sorts is a weak analogy.
You're right, but as someone who's done a lot of sound editing/foley work I can't help having mixed feelings at seeing yet another job skill automated away. Good part - in a few years this will be good enough for commercial use which will save sound editors all sorts of tedious dull work and free them up to do more exciting creative stuff. Bad part: the tedious dull work was also what paid the bills. The easier it is to do that stuff automatically, the less people are willing to pay for good quality work.
Rather than now being able to make a living do the sort of fun really creative stuff like inventing new sounds for teleportation devices or dramatic natural phenomena, editors are more likely to be asked to work for free on the theory that they'll gain great exposure for their creativity. That's generally a very bad bargain. If past trends in the electronic dance music market are anything to go by, increasing automation will not reward true creative talent but rather just lead to an arms race to have the latest sound libraries, synthesizers etc. and just be the first to market with big splashy new sounds that offer superficial novelty.
Ability to provide high-value equipment below normal rental cost frequently trumps considerations of talent in the film industry. Similarly there are plenty of crappy directors of photography out there who get hired regularly because they own a pile of nice lenses and related camera equipment, and hiring them plus their camera package looks economically attractive on paper because it's hard to quantify photographic talent.
I have so much respect for foley artists. Artist being the operative word. People don't appreciate the hard work and creativity that goes into making the perfect sound.
As someone who's worked in the film industry (visual fx & CG) I'm subject to the same problem, but all tedious jobs from the industrial revolution on have been automated away one by one. I can understand the lament for something you worked hard on, and this isn't to take that away from you, but most job skills do actually have less value in the market over time, right? The other way to look at it is that what counts as good quality work changes and improves continually over time, higher and higher quality becomes available for the same price. Jobs are continually being reinvented, and people always get to work on the interesting parts that can't be automated. Something that took many people to do one decade only takes one person the next decade. This has been true for hundreds of years from farmers to accountants to cooks to car makers ... This "problem" is here to stay, our economy hasn't crashed yet, and there are as many creative people as ever.
the tedious dull work was also what paid the bills. The easier it is to do that stuff automatically, the less people are willing to pay for good quality work.
"Vegetable Violence is an organic sound effects library for creating your own orchestrated sonic mayhem. Vegetable rips, tears, squelches, hits, punches, stabs all recorded & mastered at 96kHz for stomach churning realism, this component library of gore sound effects is available for immediate download."
Watch Dario Argento's films - Profundo Rosso, Suspiria, Inferno, Tenebrae. Those should all be quite easy to get hold of. Then start in on this list... https://en.wikipedia.org/wiki/Giallo
Suspiria in particular sticks in the mind. Great music, saturated colours, properly horrific horror. It's a bit more 'on-screen' than BSS, btw.
AI that produces sound through analysis of a source video is impressive. Fooling humans is not. Since most of us have grown up on a steady diet of film and television many of the sounds we have in our memories are the work of foley artists that add sound effects to sequences in post.
Right on! The fact that most audiences seem to expect a shotgun racking sound in a scene with a shotgun that doesn't even have that mechanism, or that drawing a katana is so often accompanied by a metallic "shing" and rattling sound -- these indicate the degree to which large swathes of people are drastically disconnected from an immediate and physically connected sense of how sound relates to the world around them.
I think this is also related to the degree to which I find many people are unaware of the kinesthetic feeling of how beats are emphasized, and how this changes the feel of music. The most vibrant intelligence involves a connection to the world in realtime. You can hear how machine parts interrelate just as much as you can see them. (You can even smell how they interrelate!) This disconnection even seems to be directly correlated with a loss of self awareness and flexibility in problem solving. It's like we're raising generations of brains over-trained on the simplistic and highly abstracted world of media tropes and vastly under-prepared for the messy complexity of the natural physical world.
I'd suspect that there is a wide range of what we're willing to accept in a given situation, reflecting our incomplete model of how sound works. However, this isn't the same as us accepting any substitute sound, the TV tropes continue because their absence feels awkward. An AI that correctly mimicks TV-acceptable sounds is just about as impressive (however, this isn't on our list of 'hardest problems', for sure).
I'd suspect that there is a wide range of what we're willing to accept in a given situation, reflecting our incomplete model of how sound works.
Given what I was talking about, it's largely a matter of people accepting symbols or tokens of things in lieu of perceiving the actual thing. It's a form of ignorance that masquerades as culture or "sophistication." (It is the former, but it's not the latter.)
The vocabulary of sound. There is also an equivalent vocabulary of vision.
That's one reason films from the 40s seem so different to today's. I suspect a cinema goer from then would have some trouble keeping up with the narrative of a 21 century movie.
If you see very high definition projections/videos of 40's movies, you'll find that there was sometimes an incredible humanity that came across from the actors then. The cinematography could be incredible at this. I bet a lot of modern audiences would see such a thing and be like, "Aw, man, where are the explosions!?"
I don't see what that matters. Just because the human baseline comes from what is essentially a virtual reality instead of actual reality it's still quite a challenge to generalize sounds from associated images. Just because the AI is an artificial foley artist rather than a model of the real world doesn't make it any less impressive to me.
More importantly, what are all of the foley artists of the world thinking right now? "Oh dear god no" would be my first guess. Suddenly the prospect of needing to beat a computer at your job is rearing its head.
Nah, if this "AI" becomes useful, then sound artists will just be expected to use this new tool. AIs have taken over "tweening" (which is why 3D Animators exist. To specify model movements in such a way that the Computer can automatically create physics simulations or whatever to automatically fill in the annoying-to-do crap).
But the new tools have only made 3d animation more popular, leading to even more artists and more 3d animated content. And bigger productions (ie: Big Hero Six used a lot of AI for the city. "The Lion King" used flocking AI to animate all the Bison during the Stampede scene.)
AIs don't always destroy jobs, they sometimes create them. They replace jobs that no one wants to do (who wants to animate 500+ bison running down a cliff? Nah, lets have the AI do that). Letting the artists focus on more meaningful tasks, leading to a higher quality in production.
What will people do with this tool exactly, once it's a mature tool? It doesn't sound like it will require much in the way of human guidance at some point down the line.
Oh come on. Amplitude, envelopes, equalizing, balancing the audio.
If this were a professional production that needed to be matched up with two voice tracks and background music, the sound designer will use the AI to create the sound for the background events... but still needs to balance the various audio tracks so that the audience knows what to focus on.
The abstract subway sound in the background may be chosen by an AI rather than a human. But the human will still need to determine the amplitude of the various voice tracks, the background music. Its not like these films make themselves.
Even IF somehow an AI became good enough to make all those decisions (and most of those decisions are more "style" and "art" rather than hard-and-fast rules)... the video editor needs to choose the cuts, the order of the scenes and more.
No jobs will be at risk by this tool. If successful, it'd only become one more tool in the MASSIVE toolbox that video editors / sound designers are expected to master.
-----------------------
Anyway, "Tweening" AI completely eradicated one form of work for cartoonists. Humans aren't doing "tweening" work anymore. Big studios are making 3d productions where AIs can "tween" everything for you. Even 2d anime are using 3d animation techniques to cut down on the work and to leverage the AI.
It takes no work to command the AI to "tween" frames. But picking the right algorithm, deciding when to use "smear" animation style (stylized tweening) or changes to algorithms to switch things up for the audience?
Yeah, those things will always just be straight up work.
You are looking at this from such as simplistic perspective.
We are not at the limits of what machine learning can do, We are barely at the beginning.
Tweening or "make sound that can fool humans" is not what's important here it's the underlying "mechanics" that allow them to do these things which can be applied to so many other areas.
What make humans extraordinary is that we can combine our various mechanical and intellektuel abilities to adapt to our surroundings. What we are witnessing is another "species" who can do this and is only at the beginning.
I spent two years of my life writing an automated logic system for a professor. Trust me, I know what AI can and can't do.
In my experience, when AI reaches the critical mass of usefulness, it becomes a tool within the industry. Automatic solving of logical puzzles? Yeah, electrical engineers use PSpice to optimally lay out logic gates in CPU Chips.
Automated logic can be used to verify extremely complicated mathematical proofs, or even search for new mathematical truths! So what happens? Well... some company creates a product with the AI, then sells it as a tool.
We exist in an age where AIs are responsible for searching and coalescing information. (Erm, how often do you use Google's database?) It wasn't very long ago when search was considered an AI problem... but as soon as computers successfully do it better than humans, its a "tool" and "not AI" anymore.
The last 50 years of AI history has taught me one thing: when AIs are successful, humans change and stop thinking that the task was "intelligent". Chess as a measure of intelligence? No longer, once Chess AIs got good.
Database search? No longer, now that Google is faster than humans.
Automated driving? Was an AI task, now it isn't one. People are already discussing how its a tool for Truckers or Uber to use to make more money.
-------------
"Intelligent tasks" become "tasks for tools" because thats how stuff sells on the market. You wouldn't believe what people thought was "intelligence" in the 80s. Chess, Databases Search / Natural Language Processing, automated logic, symbolic mathematical solvers, chip layouts, compiler optimizations... and everything that we just take for granted today.
Similarly, the tasks we consider "intelligence" today will simply turn into tools for the next generation once the AIs are written that solve that problem.
You are IMO making the same mistake Searle made in his "Chinese Room" argument.
Your digestive system is a tool for you to get rid of garbage your system does not need, your neurons are tools for allowing you to ultimately think, your legs are tools for allowing you to move around.
It's the entire system thats relevant here. Not any of the individual subparts.
And we are not talking about what it can or can't do but it's potential.
You brush this off as a humans will always find a way... which is what I am objecting to.
Whether you spend 2 or 20 years writing automated logic systems for a professor is unimportant.
As a sound editor with 2 decades of first doing it for fun and then for a career I don't think that balancing the tracks is uniquely human and immune from automation.
This isn't going to lead to some new golden age of well-produced soundtracks, it's just going to make big bombastic soundtracks cheaper and more common. For a few years everything is going to sound like a disaster movie. Some would argue that we've already got that problem and I can't entirely disagree.
In short, quality won't go up, prices will go down and oversupply will result in excess.
Good! I can't abide foley. Nature documentaries are ruined by it. A tiny ant eating a leaf, accompanied by horrible sounds of plastic-wrap being twisted. Why not record the actual sounds of an ant eating ? It's supposed to be a documentary. And if the ant doesn't make any sound then just leave some silence.
The answer is definitely: How did you train it? If you trained it by showing it tons of movies and television, then you'd get a modern foley artist in a box, right? If you want them to do something else, you'd need to get a bunch of stock footage with stock audio. Both are doable, but which was done here?
I didn't know that, but I suppose even then they can claim that there artistry exists in the choices they make. If a machine can make equally viable choices from the audiences' and critics' perspectives...
Everyone knows that if you fall while dying you make a Wilhelm scream. It's a clearly documented scientific phenomenon with earliest records dating back to long, long ago.
The library used by Doom is REALLY common. The sound DSBOSPIT (sound of boss demon spitting a telecube in Doom 2) in particular is so overused it's not even funny. You hear it everywhere: in budget movies when a house or plane explodes, sometimes in other video games.
The Turing test in this case might consist of feeding the algorithm with a mute video of a scene from Monthy Python and the Holy Grail, when coconut halves are used to simulate the sound of galloping horses.
All of your examples sound a lot like what these things sound in real life. Yes, there's hack foley work but the reality is that these aren't arbitrary sounds. You don't need an actual horse to get horse sounds.
I think you're shooting for some idealized authenticity argument here that just doesn't work. I work in a tourist area where horses walk on cobblestone. Yeah it sounds exactly like what the foley guys do with coconuts. Its uncanny. Also in the digital age, a lot of foley work are samples of real sounds. We don't have guys in sound booths making new sounds anymore with gadgets and old shoes and such, outside of edge cases.
I know everyone likes to feel clever when they identify a popular sound, but most of the times those are intentional homages and you have to consider that millions of sounds you don't recognize. Its not all from some static library of 1930s foley artists punching meat and knocking together coconuts anymore.
Fooling humans is fairly impressive - they didn't just ask "is this real sound", they played the real sound and the synthesised sound and ask "which one is the real sound".
Although it looks like they only used a sample of 3 people which is pretty small, I imagine the parametrically synthesised sound would have fooled no-one.
Also in the clips in the video is fooled all _three_ of the participants tested. Couldn't find anything in the paper about sample sizes... but hopefully it was more than three...
The sound of the hamburger rain in "Cloudy with a Chance of Meatballs" was wet brown paper towels (you know, from school?) being flopped against a floor.
Our perceptions of sound within a video/film source is already deeply skewed and therefore the notion that this AI is a Turing test of sorts is a weak analogy.