The problem is that thread local storage is slow. It's an extra layer of indirection. Also, thread safety is just one concern in the GIL debate. One should be able to support many low-overhead independent interpreter instances on a single thread rather than relying on an arcane interpreter context switching scheme as presently exists in Python 2 and 3.
One use case, totally independent of parallelism is that if someone say embeds Python inside of Postgres or some other large system, I cannot also, in my extension, also embed Python. First one wins and now our systems have to agree. In Lua, you allocate an interpreter context and use that. There can be any number of Lua interpreters embedded in a large system without conflicting.
The use of the globals in the CPython interpreter are a fairly large design mistake that has prevented Python from having as much reach as other systems. The whole embedding vs extending debate is because of those globals and that CPython has been historically difficult to embed properly. Lua on the other hand, is easy to both embed and extend.
I'd love to use `cffi` to embed Python within itself, I cannot do that.
E.g. https://bitbucket.org/tpn/pyparallel/src/3be2954508f9938b85a...