Recursive divide-&-conquer algorithms are known for solving dynamic programming (DP) problems efficiently on shared-memory multicore machines. In this work, we extend them to run efficiently also on manycore GPUs and distributed-memory machines without changing their basic structure.
Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive $r$-way divide and conquer, where r (>= 2) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.
We also report empirical results for our algorithms.