Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>The fallacy here is assuming this is a necessary trade off. You can have good single threaded performance without a GIL.

It depends what you're doing. Classic example: I have a huge dataframe, 10s-100s of GB. I want to process it in multiple threads, with each thread handling a different part of it. If I just use multiprocessing, it has to copy the memory to the new processes and then copy the results back, which is super slow. Sure there are hacky work-arounds and alternative approaches, but in a language without the GIL I don't need to fiddle around with those, I can just do the equivalent of Pool.map() and it doesn't need to copy anything.



1) GIL or no GIL does not make a single visible difference in terms of your Python code, how it works, what it can or cannot do. It's only a matter of whether the interpreter internally use locks or not.

2) There is no more copy in multiprocessing that in multithreading, to the very minor exception of the reference counting structures used internally by the interpreter which will get copied on write.

3) The problem you describe is not related to multiprocessing or multithreading, it's just related to a misunderstanding of the Pool.map() API. What you provide to Pool.map() is sent over a queue (and thus pickled) to worker threads - or processes. You don't _have_ to use this queue, so long that your function has a way to access the variable you want to use. That is, the following code will have your subprocesses share, without copying, your 100GB dataframe:

    DF = pd.read_parquet('100GB.pq')
    
    def worker_process(id):
        # Do something with DF
        return 42
    
    with multiprocessing.Pool(10):
        pool.map(work_process, [1, 2, 3, 4, 5])




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: