Multi-Core by Default

(rfleury.com)

50 points | by kruuuder 7 hours ago

10 comments

jerf 44 minutes ago
If the author has not already, I would commend to them a search of the literature (or the relevant blog summaries) for the term "implicit parallelism". This was an academic topic from a few years back (my brain does not do this sort of thing very well but I want to say 10-20 years go) where the hope was that we could just fire some sort of optimization technique at normal code which would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.
In order to do this, the first thing that was done was to analyze existing source code and determine what the maximum amount of implicit parallelism was that was in the code, assuming it was free. This attempt then basically failed right here. Intuitively we all expect that our code has tons of implicitly parallelism that can be exploited. It turns out our intuition is wrong, and the maximum amount of parallelism that was extracted was often in the 2x range, which even if the parallelization was free it was only a marginal improvement.
Moreover, it is also often not something terribly amenable to human optimization either.
A game engine might be the best case scenario for this sort of code, but once you start putting in the coordination costs back into the charts those charts start looking a lot less impressive in practice. I have a sort of rule of thumb that the key to high-performance multithreading is that the cost of the payload of a given bit of coordination overhead needs to be substantially greater than the cost the coordination, and a games engine will not necessarily have that characteristic... it may have lots of tasks to be done in parallel, but if they
samuelknight 13 minutes ago
There is a deep literature on this in the High Performance Computing (HPC) field, where researchers traditionally needed to design simulations to run on hundreds to thousands of nodes with up to hundreds of CPU threads each. Computation can be defined as dependency graphs at the function or even variable level (depending on how granular you can make your threads). Languages built on top of LLVM or interpreters that expose AST can get you a long way there.
MontyCarloHall 2 hours ago
This post underscores how traditional imperative language syntax just isn't that well-suited to elegantly expressing parallelism. On the other hand, this is exactly where array languages like APL/J/etc. or array-based frameworks like NumPy/PyTorch/etc. really shine.
The list summation task in the post is just a list reduction, and a reduction can automatically be parallelized for any associative operator. The gory parallelization details in the post are only something the user needs to care about in a purely imperative language that lacks native array operations like reduction. In an array language, the `reduce` function can detect whether the reduction operator is associative and if so, automatically handle the parallelization logic behind-the-scenes. Thus `reduce(values, +)` and `reduce(values, *)` would execute seamlessly without the user needing to explicitly implement the exact subdivision of work. On the other hand, `reduce(values, /)` would run in serial, since division is not associative. Custom binary operators would just need to declare whether they're associative (and possibly commutative, depending on how the parallel scheduler works internally), and they'd be parallelized out-of-the-box.
jongjong 2 minutes ago
Some code should be single core; like for example, a frontend UI for a web application... You don't want to be hoarding all of the user's CPU capacity with your frontend.
But I do like implementing my backends as multi-core by default because it forces me to architect the system in a simple way. In many cases, I find it easier to implement a multi-core approach. The code is often more maintainable and secure when you don't assume that state is always available in the current process. It forces a more FP/stateless approach. Or at least it makes you think really hard about what kind of state you want to keep in memory.
zelphirkalt 2 hours ago
I think the actually interesting non-obvious part starts at "Redesigning Algorithms For Uniform Work Distribution". All the prior stuff done you basically get for free in a functional language that has some thread pool or futures built in, doing FP. The real question is how you write algorithms or parts of programs in a way that they lend themselves to be run in parallel as small units with results that easily merge again (parallel map reduce) or maybe don't even need to be merged again. That is the real difficult part, aside from transforming some mutating program into FP style and having appropriate data structures.
And then of course the heuristics start to become important. How much parallelism, before overhead eats the speedup?
Another question is energy efficiency. Is it more important to finish calculation as quickly as possible, or would it be OK to need some longer time, but in total calculate less, due to less overhead and no/less merging?
bob1029 3 hours ago
The thing I struggle with is that most userland applications simply don't need multiple physical cores from a capacity standpoint.
Proper use of concepts like async/await for IO bound activity is probably the most important thing. There are very few tasks that are truly CPU bound that a typical user is doing all day. Even in the case of gaming you are often GPU bound. You need to fire up things like Factorio, Cities Skylines, etc., to max out a multicore CPU.
Even when I'm writing web backends I am not really thinking about how I can spread my workload across the cores. I just use the same async/await interface and let the compiler, runtime and scheduler figure the annoying shit out for me.
Task.WhenAll tends to be much more applicable than Parallel.ForEach. If the information your code is interacting with doesn't currently reside in the same physical host, use of the latter is almost certainly incorrect.
[-]
- CJefferson 2 hours ago
  I find async a terrible way to write interactive apps, because eventually something will take too long, and then suddenly your app jerks. So I have to keep figuring out manually which tasks need sending to a thread pool, or splitting my tasks into smaller and smaller pieces.
  I’m obviously doing something wrong, as the rest of the world seems to love async. Do their programs just do no interesting CPU intensive work?
  [-]
  - pmontra 1 hour ago
    Probably. In web development it's usually get data, transform data, send data. That's in both directions, client to server and viceversa. Transformations are almost always simple. Native apps maybe do something more client side on average but I won't bet anything more than a cup of coffee on that.
  - CaptainOfCoit 2 hours ago
    Are you using multiple threads or just a single one? Not sure why your application would "jerk" because something takes long time? If it's in a separate thread, it being async or not shouldn't matter, or if it's doing CPU intensive work or just sleeping.
    [-]
    - CJefferson 1 hour ago
      If I’m using threads for each of my tasks, then why do I need async at all? I find mixing async and threads is messy, because it’s hard to take a lock in async code, as that blocks other async code from running. I’m sure this can be done well, but I failed when I tried.
  - dboreham 1 hour ago
    Rest of the world doesn't love async. Just the loud opinionated people.
    [-]
    - hombre_fatal 1 hour ago
      Depends on the abstraction.
      Promise/future/async/await is pretty good compared to the code it's replacing.
      Meanwhile I worked on a Netty (async Java web server) app that I never quite understood. Not even when it was "simplified" to use the CompletionStage API[1]. I could see someone swearing off async for life after that.
      [1]: https://docs.oracle.com/javase/8/docs/api/java/util/concurre...
gethly 1 hour ago
Everyone wants parallelism until mutexes enter the room.
jillesvangurp 4 hours ago
The main challenge here is that a lot of languages have historically treated threading as an afterthought. Python is a good example where support was so limited (due to the GIL, which they are only now in the process of removing) that people mostly just didn't bother with it and just tried to orchestrate processes instead. Languages like go and javascript are really good at async stuff but that's mostly all happening on 1 core. You can of course run these with multiple cores but you have only limited control over which core does what.
Java has had threading from v1. Fun fact, it was all green threads in 1.0. Real threads that were able to use a second CPU (if you had one) did not come until 1.1. And they've now come full circle with a way to use "virtual" threads. Which technically is what they started with 30 years ago. Java also went on a journey of doing blocking IO on threads, jumping through a lot of hoops (nio) to introduce non blocking io, and lately rearchitecting the blocking io such that you can (mostly) pretend your blocking io is non blocking via virtual threads.
That's essentially what project Loom enables. Pretty impressive from a technical point of view but it's a bit of a leaky abstraction with some ugly failure modes (e.g. deadlocks if something happens to use the synchronized keyword deep down in some library). If that happens on a single real thread running a lot of virtual threads, the whole thread and all the virtual threads on it are blocked.
There are other languages on the JVM that use a bit higher level abstractions here. I'm mainly familiar with Kotlin's coroutines. But Scala went there before them of course. What I like in Kotlin's take on this is the notion of structured concurrency where jobs fork and join in a context and can be scheduled via dispatchers as a light weight co-routine, a thread pool, or a virtual thread pool (same API, that kind of was the point of Loom). So, it kind of mixes parallelism and concurrency and treats them as conceptually similar.
Structured concurrency is also on the roadmap for Java as I understand it. But a lot of mainstream languages are stuck with more low level or primitive mechanisms; or use completely different approaches for concurrency and paralellism. That's fine for experts using this for systems programming stuff but not necessarily ideal if we are all going to do multi core by default.
IMHO structured concurrency would be a good match for python as well. It's early days with the GIL removal but the threading and multiprocess modules are a bit dated/primitive. Async was added at some point in the 3.x cycle. But doing both async & threading is going to require something beyond what's there currently.
[-]
- pjmlp 3 hours ago
  On .NET side, there is dataflow framework based on TPL for structured concurrency, but few people are even aware it exists, it is async/await all over the place nowadays.
stephc_int13 2 hours ago
I think this is less innovative than it seems.
The approach described in this article is to reverse the good old fork/join, but it would only be practical for simple sub tasks or basic CLI tools, not entire programs.
In the end, using this style is almost the same as doing fork/join, except the setup is somewhat hidden.
[-]
- rfleury 2 hours ago
  Those interested can go look at all of the actual code I’ve written using these techniques, and decide for themselves whether or not it’s practical only for “simple sub tasks or basic CLI tools”:
  https://github.com/EpicGamesExt/raddebugger/blob/c738768e411...
  https://github.com/EpicGamesExt/raddebugger/blob/master/src/...
  [-]
  - jt2190 2 hours ago
    Posting a tl;dr here might stave off some of dismissive comments based only on only reading the headline.
- imtringued 56 minutes ago
  Based on the title I would have assumed that the programming model would be inverted, but it wasn't. What is needed is something akin to the Haskell model where the evaluation order is unspecified while simultaneously allowing mutation. The way to do this would be a Rust style linear type system where you are allowed to acquire exclusive write access to a region in memory, but not be allowed to perform any side effect and all modifications must be returned as if the function was referentially transparent. This is parallel by default, because you actively have to opt into a sequential execution order if you want to perform side effects.
  The barriers to this approach are the same old problems with automatic parallelization.
  Current hardware assumes a sequential instruction stream with hardware threads and cores and no hardware primitive in the microsecond range to rapidly schedule code to be executed on another core. This means you must split your program into two identical programs that then are managed by the operating system. This kills performance due to excessive amount of synchronization overhead.
  The other problem is that even if you have low latency scheduling, you still need to gather a sufficient amount of work for each thread. Too fine grained and you run into synchronization overhead (no matter how good your hardware is), too coarse grained and you won't be able to spread the load onto all the processors.
  There is also a third problem that is lurking in the dark and many developers with the exception of the Haskell community are underestimating: Running programs in a suboptimal order can lead to a massive increase in the instantaneous memory usage to the point where the program can no longer run. Think of a program allocating memory for each request, processing it and then deallocating, then allocating again. What if it accepts all requests in parallel? It will first allocate everything, then process everything and then deallocate everything.
kelsolaar 5 hours ago
Ryan is the author of the RAD Debugger: https://github.com/EpicGamesExt/raddebugger