>The fallacy here is assuming this is a necessary trade off. You can have good s...

Galanwe · on Aug 18, 2023

1) GIL or no GIL does not make a single visible difference in terms of your Python code, how it works, what it can or cannot do. It's only a matter of whether the interpreter internally use locks or not.

2) There is no more copy in multiprocessing that in multithreading, to the very minor exception of the reference counting structures used internally by the interpreter which will get copied on write.

3) The problem you describe is not related to multiprocessing or multithreading, it's just related to a misunderstanding of the Pool.map() API. What you provide to Pool.map() is sent over a queue (and thus pickled) to worker threads - or processes. You don't _have_ to use this queue, so long that your function has a way to access the variable you want to use. That is, the following code will have your subprocesses share, without copying, your 100GB dataframe:

    DF = pd.read_parquet('100GB.pq')
    
    def worker_process(id):
        # Do something with DF
        return 42
    
    with multiprocessing.Pool(10):
        pool.map(work_process, [1, 2, 3, 4, 5])