Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Voc: a physical model of the vocal tract, written in ANSI C (pbat.ch)
161 points by adamnemecek on June 12, 2017 | hide | past | favorite | 46 comments


Hey! I'm the author of this thing. Let me know if you have any questions about it.

FYI, the generated C code is now part of the dev branch of Soundpipe here: http://pbat.ch/proj/soundpipe.html, my music DSP library, and it has also made it's way into the develop branch of AudioKit: http://audiokit.io.

Also, check out the original implementation Pink Trombone: https://dood.al/pinktrombone/. It's the perfect interface for this kind of model.


Funnily enough, I also made a C90 port of the same code, and also named it "voc". Here is an oldish version:

http://ultra-premium.com/scratch/voc.zip

I would completely expect yours to be better in all ways (mine was basically a line for line transcription of all the audio and stripping out the UI stuff, where as you actually went to the effort of understanding the thing being ported).


I found your code last night, and it really freaked me out that there would be another person in the world who would want to make a C port of pink trombone and call it Voc as well.

Voc is pretty much a line-for-line port of PT as well, but I removed some bits like the simplex noise. I also wrote some small utilities to go with Voc, like small plotting programs and some plugins for my audio language for both the whole source-filter model and just the filter.

Still sinking my teeth into the literature. Voice synthesis has a very rich history!


What do you mean oldish version? Looks like it's from 2017 , or from an old version of pink trombone. Do you have the latest code in source control anywhere online?


Very cool (soundpipe). Since I was a wee lad, I've been thinking about colonizing all major open source DAW/sequencer/audio editor programs with a single processing core, to the extent which would be acceptable. I think a lot more could be done if these individual projects (like Audacity, Hydrogen, Lmms, etc.) were just basically "shells" on the audio processing pipeline.


Do it! I think it's a rewarding exercise.

Also, build a language. Mine is Sporth: http://pbat.ch/proj/sporth.html

More recently, I've been building UIs on top of Sporth called Spigot: http://pbat.ch/proj/spigot


Thanks to JACK, they kinda already are able to operate together decently.


is there a way to make it say words?

I think pinktrombone would make an excellent resource for learning how to make certain sounds when learning a foreign language.


In theory, I think it's doable, but you'd have to build some sort of interface for it. Right now, all that exists are the low level foundations.

The KL model does sing! Max Matthews and Bell Labs produced "Daisy Bell" using a very similar model 1960:

http://www.cs.princeton.edu/~prc/Daisy.mp3

This was the inspiration for HAL to sing Daisy in 2001: A Space Odyssey.


I'm interested in doing the reverse actually - take a voice recording, reverse-map it to the parameters of the model, then use those parameters and a different physical vocal tract to change your voice!


This would have a large audience of voice and speech training professionals if it were available "ready to use" with no coding or compiling knowledge required. I think it's awesome!


This would get a lot of attention from the music production community if you could port it to a VST or AU plugin for use in a DAW!


Interesting use of Knuth's literate programming concept. I remember reading about it and the source code of TeX in PDF format a long time ago, but now that I have a chance to read another piece of code written in it, I'm finding it less readable because of the proportional fonts than if it were equally-commented but in a more conventional monospace programming font.

Also, some samples of making it talk would be good, like: https://en.wikipedia.org/wiki/Voder


> I'm finding it less readable because of the proportional fonts than if it were equally-commented but in a more conventional monospace programming font.

My initial motivations for using literate programming with Voc was to take advantage of TeX's math mode to express what was happening numerically in code, as well as to have the ability to use BibTex inside the code.

> Also, some samples of making it talk would be good, like: https://en.wikipedia.org/wiki/Voder

At the bottom of the page, there are music examples on Vimeo, with plots of the 44 vocal tract diameters being manipulated in realtime:

https://vimeo.com/220091107

https://vimeo.com/220091290

https://vimeo.com/220091487

My goal was really to build vocalizations, and not necessarily to produce speech. This engine is a bit more low level than that. It could be possible to build a speech engine on top of Voc though... next steps perhaps?


The font issue should be fixable with some configuration. I personally do literate programming in org mode and generate HTML files from it for my colleagues and myself. With it, code blocks output to HTML use a fixed width font. I haven't checked the tex output (low priority, the HTML output is good enough for me).


Great! Also inspired by Pink Trombone, we ported the same model to our maker platform and then to a modular synth:

https://www.youtube.com/watch?v=bo5ZEgBEapk

https://twitter.com/BelaPlatform/status/856110345332674561

https://github.com/giuliomoro/pink-trombone


Hey yeah! I actually came across that when I first set out to make my project.

I was going to actually fork off your project, but decided it would be cleaner/faster to do it off the original code since I wanted to write it in ANSI C.


Slightly off-topic sorry, but why ANSI C and not C99 (or C11)? Is it common to come across systems where C99 isn't supported, or are there reasons why you prefer C89 to C99?


There aren't any real reasons I have for choosing C89 over C99. Both tend to be very portable, which is very nice if you aren't sure what operating system you are running on (if any, in many situations). I still write many programs using the "-std=c99" flag, but I never find myself in dire need of the extensions, basically honorary ANSI C. For projects like this that just do numerical processing, C89 C really isn't that much more of a hassle.


I think you probably made a good choice, since our fork of the original was worked on very quickly to get a basic demo going :)


Ha. Your README was very honest at the time :)

It was very encouraging for me to see some ports to C/C++ already in progress. At the time, it was definitely an overwhelming notion. That chunk of JS code looked impenetrable to me.


Nice, are you planning on releasing a Eurorack form-factor kit for your platform? I'd imagine this can be a good alternative to other digital modules like nw2s::b and Rebel's OWL.


Hello, yes we are going to do a limited edition run, it is actually a collaboration with Rebel Tech! From our latest Kickstarter update:

"The Bela Modular breaks out the audio, analog and digital I/Os to jacks, and handles voltage scaling for Eurorack-compatible CV levels, providing a total of 2 audio in, 2 audio out, 8 analog in, 8 analog out, 4 digital in, 4 digital out and 4 LEDs over two modules, 12HP and 10HP wide.

[...]

We are planning to do a small production run of Bela Modular units later this year. Please contact us[0] directly if you think you would like one of these, or stay tuned here and on the forum.[1]"

[0] info at bela dot io

[1] http://forum.bela.io


Would this be a good target for deep learning? It's low level in some sense, but still nicely parametric. It strikes me that this could be a good synth for some neural nets to learn how to play.


Yes, speech recognition and speech generation would be easier to implement if you used neural networks that were trained on these vocal cord inputs rather than audio samples. In either case, you'd need to solve the inverse problem to generate vocal cord parameters given an audio sample. This seems difficult but I'd imagine some commercial software packages do it to some extent.


Or you could let the neural network solve the inverse problem for you, at least for speech generation.


I would imagine a neural network that fed parameter values to Voc would be far faster (real-time?) than something like WaveNet which needs to sample the output thousands of times a second.


Correct. WaveNet is a very brute force approach to speech synthesis.


It seems like it would be, right? There are 44 tract-diameters you can modify to shape the vocal tract, and these can be used to generated specific vowel formants. I can imagine you can build a system using deep learning that can find the best parameters to match a steady state periodic pitch. It's a bit how some speech codecs work, like LPC10.


Problem is, the output would sound pretty metallic, just like LPC10.


Maybe, maybe not. LPC10 is a 8kHz speech codec optimized for low-bandwith signals. The Kelley-Lochbaum is a full-blown physical model of the tract.

What you put into the filter is important. The LF glottal pulse model used here is a pretty good excitation signal... aspiration noise REALLY makes a difference. It would still sound artificial, but it definitely wouldn't sound metallic.


This is quite cool!

On a sidenote, can this be used to train or obtain voice parameters of oneself for using it in software programs like Espeak?


Not directly, no. IIRC, programs like Espeak and Festival use formant synthesis, which would require explicit formant values. Voc models the tract itself... the main parameters are diameters in the vocal tract (which implicitly produce vowel sounds).

It may be possible to go the other way around and analytically derive parameters for Voc that match target formant frequencies. Not sure though...


There are other ways though e.g. if features like formants produced by the model can be differentiated with respect to vocal tract parameters, the latter could be estimated based on real data maybe


Here's an example of it in action https://vimeo.com/221310975


That was... very strange. I'm not sure how else you'd demonstrate a physical vocal tract model though.


This is super cool. I seem to remember a JS version of this floating around earlier this year. Does anyone have it?


It's the first link on the page


Is there something that does the opposite? Feed vocals, get jaw movements? I've got a weird little project, think "talking skeleton".


There are numerous programs that translate spoken words to mouth/jaw movements -- usually for animation (to match the 3D /2D model with the right movements for what it's supposed to say).

IIRC, Adobe Animator does that too.


Thank. Unfortunately, I need something suitable for embedded hardware.



Wouldn't you also need to model tongue and mouth chamber (with teeth) and all of their movements?


Yes and no. Perceptually, you don't really need to model everything to get convincing speech sounds. Most of the realism actually comes from performance, and not the mathematical model.

In a way, lips and mouth are accounted for here, but in a more abstract away. The KL model approximates the vocal tract as a series of cylindrical tubes with varying diameters. Segments of the tubes actually correspond to things like the tongue and mouth somewhat. In this model there is a really neat tongue control that manipulates these segments. It's quite expressive!

This model is a 1d waveguide, so it doesn't account for things like the curvature of the tract. More modern vocal modelling techniques include implementing a 2-dimensional waveguide, which does allow for this control.


Title says "Vox". Page says "Voc".


fixed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: