MUDA is μ-CUDA, yet another painless CUDA programming paradigm.
COVER THE LAST MILE OF CUDA
Introduction
What is muda?
- Header-only library right out of the box
- Depends only on CUDA and standard libraries
- Improve readability, maintainability, security, and debugging efficiency of CUDA code.
- Reduce the difficulty of building CUDA Graphs and increase automation.
Why muda?
Think that you wanna try a new idea or implement a demo with CUDA. If the demo works well, you want to embed it into your project quickly. But you find that using CUDA directly will be a catastrophic disaster.
Coding with C-API CUDA, you will be buried in irrelevant details, and the information density of the code is very low, which means that you need to write a lot of redundant code to achieve a simple function. Less is more, right?
Debugging the GPU code is a nightmare. How much time do you spend on those weird Error Codes from CUDA, and finally find it's an illegal memory access?
Using a GPU Debugger when something goes wrong may be an approach. But the best way is to prevent bugs from happening, right? Most of the time, automatic range checking is all we want. It's a pretty fantasy that someone tells you, "Hey Bro, at `Block 103, Thread 45`, in kernel named `set_up_array`, the buffer named `array` goes out of range because your index is `100` while the array size is `96`." After that, it exits the kernel and stops the program for you to prevent later chained dummy bugs from producing confusing debug information.
It is muda!
If you access memory resource using a MUDA Viewer. Dear muda will tell you all about that.
MUDA also provides an elegant way to create and update CUDA Graph, called MUDA ComputeGraph
.
- Users almost take only a bit of effort to switch from the Stream-Base Launch Mode to Graph Launch Mode.
- Updating the node parameters and shared resources in CUDA Graphs becomes intuitive, safe, and efficient.
Simple to extend.
- User can obey the primary interface of muda to define their own object to reuse the MUDA facility
- Almost all "Resource View Type" can be used directly in the MUDA
ComputeGraph
.
A substitution of thrust?
Nop! MUDA is a supplement of thrust!
Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high-performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C.
Using iterators to prevent range error is a high-level approach. However, we still need to access the memory manually in our own kernel, no matter whether using raw cuda kernel launch <<<>>>
or using thrust agent kernel (most of the time, using a thrust::counting_iterator
in a thrust::for_each
algorithm).
So, I think MUDA is a mid-level approach. We have the same purpose but different levels and aim at different problems. Feel comfortable to use them together!
Here is an example for using Thrust and MUDA together.
using namespace muda;
using namespace thrust;
constexpr auto N = 10;
for_each(thrust::cuda::par_nosync,
make_counting_iterator(0), make_counting_iterator(N),
[x = x.viewer().name("x")] __device__(int i) mutable
{
x(i) = i;
});
Definition kernel_label.h:7
Overview
This is a quick overview of some muda APIs.
You can check it to find out something useful for you. A comprehensive description of MUDA is placed at Tutorial.
Launch
Simple, self-explanatory, intellisense-friendly Launcher.
#include <muda/muda.h>
using namespace muda;
__global__ void raw_kernel()
{
printf("hello muda!\n");
}
int main()
{
.apply(
[] __device__()
{
print("hello muda!\n");
}).wait();
constexpr int N = 8;
.apply(N,
[] __device__(int i)
{
print("hello muda %d!\n", i);
}).wait();
32 )
.apply(N,
[] __device__(int i)
{
print("hello muda %d!\n", i);
}).wait();
[] __device__(int i)
{
print("hello muda %d!\n", i);
}).wait();
Kernel{32, 64,0, stream, other_kernel}(...)
}
A wrapper of raw cuda kernel launch in muda style, removing the <<<>>> usage, for better intellisense...
Definition launch.h:86
a frequently used parallel for loop, DynamicBlockDim and GridStrideLoop strategy are provided,...
Definition parallel_for.h:116
Logger
A std::cout
like output stream with overload formatting.
.apply(
[logger = logger.viewer()] __device__() mutable
{
logger << "int2: " << make_int2(1, 2) << "\n";
logger << "float3: " << make_float3(1.0f, 2.0f, 3.0f) << "\n";
})
.wait();
logger.retrieve(std::cout);
You can define a global__device__ LoggerViewer cout
and call the overloaded constructor Logger(LoggerViewer* global_viewer)
to use it without any capturing, which is useful when you need to use logger in some function but don't want to put the LoggerViewer
in the function parameter.
namespace foo
{
__device__ void say_hello() { cout << "hello global logger!\n"; }
}
int main()
{
checkCudaErrors(cudaGetSymbolAddress((void**)&viewer_ptr, foo::cout));
Launch().apply([]__device__()
mutable
{
foo::say_hello();
})
.wait();
logger.retrieve(std::cout);
}
Definition logger_viewer.h:47
Further, you can use muda::Debug::set_sync_callback()
to retrieve the output once wait()
is called, as:
int main()
{
checkCudaErrors(cudaGetSymbolAddress((void**)&viewer_ptr, foo::cout));
muda::Debug::set_sync_callback([&] { logger.retrieve(std::cout); });
Launch().apply([]__device__()
mutable
{
cout << "hello\n";
})
.wait();
}
Buffer
A lightweight std::vector
-like cuda device memory container.
In addition, 2D/3D aligned buffers are also provided.
std::vector<int> host(8);
buffer.copy_from(host);
int host_array[8];
buffer.copy_to(host_array);
buffer.view(0,4).copy_from(host.data());
buffer.view(0,4).copy_to(dst_buffer.view());
buffer2d.copy_to(host);
buffer3d.copy_to(host);
Definition device_buffer_2d.h:13
Definition device_buffer_3d.h:23
A std::vector like wrapper of cuda device memory, allows user to:
Definition device_buffer.h:46
The old data will be safely kept if you resize a 2D or 3D buffer. If you don't want to keep the old data, use .clear()
before your .resize()
. The result of the above chain of 2D buffer resizing is shown below.
Viewer In Kernel
MUDA Viewers provide safe inner-kernel memory access, which checks all input to ensure access does not go out of range and does not dereference a null pointer. If something goes wrong, they report the debug information as much as possible and trap the kernel to prevent further errors. Don't forget to fill the kernel_name()
and name()
out.
.kernel_name(__FUNCTION__)
.apply(
[
single = single.viewer().name("single"),
array = buffer.viewer().name("array"),
array2d = buffer_2d.viewer().name("array2d"),
array3d = buffer_3d.viewer().name("array3d"),
logger = logger.viewer(),
...
] __device__ () mutable
{
single = 1;
array(i) = 1;
array2d(offset_in_height, offset_in_width) = 1;
array3d(offset_in_depth, offset_in_height, offset_in_width) = 1;
logger << 1;
});
Definition device_var.h:11
Event And Stream
If you don't want to launch something on the default stream, use Stream
to create async streams. And you can use Event
to synchronize between streams.
on(s1)
.apply(
[v = v.viewer()] __device__() mutable
{
int next = 2;
v = next;
})
.record(set_value_done)
.apply(
[] __device__()
{
some_work();
});
on(s2)
.when(set_value_done)
.apply([v = v.viewer()] __device__()
{ int res = v; });
RAII wrapper for cudaEvent.
Definition event.h:15
RAII wrapper for cudaStream.
Definition stream.h:18
Asynchronous Operation
MUDA Launchers' functions are Asynchronous, meaning we need to call .wait()
to synchronize it on the host, unlike some APIs such as BufferView::copy_fom
. NOTE: Operations of a Launcher will be asynchronous as possible, so you should synchronize the stream by yourself, while other APIs will synchronize themselves.
Launch(stream).apply(...).wait();
Memory(stream).copy(...).wait();
Memory(stream).set(...).wait();
Definition buffer_launch.h:37
Definition buffer_view.h:162
Definition graph_launch.h:8
Feel free to ignore Memory
BufferLaunch
Launcher when building your fast demo. Directly use the convenient synchronous APIs of DeviceBuffer/BufferView
until you find it's the performance hotpot that will save you a lot of time.
It's a good practice to keep Launchers asynchronous while keeping other APIs synchronous, which obeys the 80/20 rule.
[Extension] Linear System Support
[We are still working on this part]
MUDA supports basic linear system operations. e.g.:
- Sparse Matrix Format Conversion
- Sparse Matrix Assembly
- Linear System Solving
The only thing you need to do is to declare a muda::LinearSystemContext
.
A_triplet.reshape(block_rows,block_cols);
A_triplet.resize_triplets(hessian_count);
ctx.convert(A_triplet, A_bcoo);
ctx.convert(A_bcoo, A_bsr);
ctx.convert(A_bsr, A_dense);
ctx.convert(A_bsr, A_csr);
ctx.convert(A_bcoo, A_dense);
Definition device_bcoo_matrix.h:18
Definition device_bsr_matrix.h:16
Definition device_csr_matrix.h:16
Definition device_dense_matrix.h:16
Definition device_triplet_matrix.h:14
Definition linear_system_context.h:28
We only allow users to assemble a Sparse Matrix from Triplet Matrix. And allow users to read from BCOOMatrix.
To assemble a Triplet Matrix, user need to use the viewer
of a Triplet Matrix.
A_triplet.resize(block_rows,block_cols,hessian_count);
x.resize(block_rows * 3);
b.resize(block_rows * 3)
.apply(hessian_count,
[
H = A_triplet.viewer().name("Hessian"),
g = x.viewer().name("gradient")
] __device__(int i)
{
int row, col;
Eigen::Matrix3f hessian;
Eigen::Vector3f gradient;
H(i).write(row,col, hessin);
g.segment<3>(row * 3).atomic_add(gradient);
}).wait();
ctx.convert(A_triplet, A_bcoo);
ctx.convert(A_bcoo, A_bsr);
ctx.spmv(A_bsr.cview(), x.cview(), b.view());
Definition device_dense_vector.h:16
[TODO:] Later, we may involve Expression Template to strengthen the linear algebra calculation and use lazy evaluation to arrange as many element-wise/scattering operations as possible in a single kernel.
[Extension] Field Layout
MUDA now supports the SoA/AoS/AoSoA
layouts. Users can switch between them seamlessly(with different builder parameters). The copy operation in all directions and layouts is well-supported.
Most of the time, AoSoA
is the best layout for Vector and Matrix, with better memory coalescing and more compact memory storage. With the help of Eigen::Map<>
, the read/write of a Vector/Matrix is as trivial as accessing a struct-version Vector/Matrix.
Here is a simple example of muda::Field
.
#include <muda/ext/field.h>
void field_example(FieldEntryLayout layout)
{
using namespace muda;
using namespace Eigen;
auto& particle = field["particle"];
float dt = 0.01f;
auto builder = particle.builder(FieldEntryLayout::AoSoA);
auto& m = builder.entry("mass").scalar<float>();
auto& pos = builder.entry("position").vector3<float>();
auto& pos_old = builder.entry("position_old").vector3<float>();
auto& vel = builder.entry("velocity").vector3<float>();
auto& force = builder.entry("force").vector3<float>();
auto& I = builder.entry("inertia").matrix3x3<float>();
builder.build();
constexpr int N = 10;
particle.resize(N);
.kernel_name("setup_vars")
.apply(N,
[logger = logger.viewer(),
m = m.viewer(),
pos = pos.viewer(),
vel = vel.viewer(),
f = force.viewer()] __device__ (int i) mutable
{
m(i) = 1.0f;
pos(i) = Vector3f::UnitY() + Vector3f::UnitX() * i;
vel(i) = Vector3f::Zero();
f(i) = Vector3f{0.0f, -9.8f, 0.0f};
logger << "--------------------------------\n"
<< "i=" << i << "\n"
<< "m=" << m(i) << "\n"
<< "pos=" << pos(i) << "\n"
<< "vel=" << vel(i) << "\n"
<< "f=" << f(i) << "\n";
})
.wait();
logger.retrieve();
.kernel_name("integration")
.apply(N,
[logger = logger.viewer(),
m = m.cviewer(),
pos = pos.viewer(),
vel = vel.viewer(),
f = force.cviewer(),
dt] __device__ (int i) mutable
{
auto x = pos(i);
auto v = vel(i);
Vector3f a = f(i) / m(i);
v = v + a * dt;
x = x + v * dt;
logger << "--------------------------------\n"
<< "i=" << i << "\n"
<< "m=" << m(i) << "\n"
<< "pos=" << pos(i) << "\n"
<< "vel=" << vel(i) << "\n"
<< "f=" << f(i) << "\n";
})
.wait();
logger.retrieve();
std::vector<Vector3f> positions;
pos.copy_to(positions);
pos.copy_from(positions);
pos_old.copy_from(pos);
pos.copy_to(pos_buf);
pos.copy_from(pos_buf);
}
Definition device_reduce.h:12
Additionally, resizing a subfield will resize all the entries, and the resizing is safe; all entries will be copied to a new buffer if the new size exceeds the capacity.
It'd be useful if the topology will change in your simulation. For example, the fracture simulation always changes its tetrahedron mesh, and all attributes of particles, edges, triangles, and tetrahedra will change correspondently. It will be terrible if you resize all the attributes yourself.
Note that every FieldEntry
has a View
called FieldEntryView
. A FieldEntryView
can be regarded as a ComputeGraphVar
(see below), which means FieldEntry
can also be used in ComputeGraph
.
Compute Graph
Define MUDA_WITH_COMPUTE_GRAPH
to turn on Compute Graph
support.
MUDA can generate cudaGraph
nodes and dependencies from your eval()
call. And the cudaGraphExec
will be automatically updated (minimally) if you update a muda::ComputeGraphVar
. More details in zhihu_ZH.
Define a muda compute graph:
void compute_graph_simple()
{
auto& N = manager.create_var<size_t>("N");
graph.create_node("cal_x_0") << [&]
{
[x_0 = x_0.eval().viewer()] __device__(int i) mutable
{ x_0(i) = Vector3::Ones(); });
};
graph.create_node("copy_to_x")
graph.create_node("copy_to_y")
graph.create_node("print_x_y") << [&]
{
[x = x.ceval().cviewer(),
y = y.ceval().cviewer(),
N = N.eval()] __device__(int i) mutable
{
if(N <= 10)
print("[%d] x = (%f,%f,%f) y = (%f,%f,%f) \n",
i,
x(i).x(),
x(i).y(),
x(i).z(),
y(i).x(),
y(i).y(),
y(i).z());
});
};
graph.graphviz(std::cout);
}
Definition compute_graph.h:38
Definition compute_graph_var_manager.h:15
Launch a muda compute graph:
void compute_graph_simple()
{
auto N_value = 4;
N.update(N_value);
x_0.update(x_0_buffer);
x.update(x_buffer);
y.update(y_buffer);
graph.launch(stream);
graph.launch(true, stream);
}
Dynamic Parallelism
MUDA support dynamic parallelism based on MUDA ComputeGraph.
void dynamic_parallelism_graph()
{
std::vector<int> host(16);
std::iota(host.begin(), host.end(), 0);
ComputeGraph graph{manager,
"graph", ComputeGraphFlag::DeviceLaunch};
auto& src_var = manager.create_var("src", src.view());
auto& dst_var = manager.create_var("dst", dst.view());
graph.$node("copy")
{
};
graph.build();
ComputeGraph launch_graph{manager,
"launch_graph", ComputeGraphFlag::DeviceLaunch};
auto& graph_var = manager.create_var("graph", graph.viewer());
launch_graph.$node("launch")
{
[graph = graph_var.ceval()] $()
{
graph.tail_launch();
});
};
manager.graphviz(std::cout);
launch_graph.launch().wait();
}
MUDA vs. CUDA
void muda()
{
dv.fill(1);
.kernel_name("my_kernel")
.apply(64,
[
dv = dv.viewer().name("dv")
]
__device__(int i) mutable
{
dv(i) *= 2;
})
.wait();
}
__global__ void times2(int* i, int N)
{
auto tid = threadIdx.x;
if(tid < N)
{
i[tid] *= 2;
}
}
void cuda()
{
thrust::device_vector<int> dv(64, 1);
auto dvptr = thrust::raw_pointer_cast(dv.data());
cudaStream_t s;
checkCudaErrors(cudaStreamCreate(&s));
times2<<<1, 64, 0, s>>>(dvptr, dv.size());
checkCudaErrors(cudaStreamSynchronize(s));
}
Build
Xmake
Run example:
$ xmake f --example=true
$ xmake
$ xmake run muda_example hello_muda
To show all examples:
$ xmake run muda_example -l
Play all examples:
Cmake
$ mkdir CMakeBuild
$ cd CMakeBuild
$ cmake -S ..
$ cmake --build .
Copy Headers
Because muda is header-only, copy the src/muda/
folder to your project, set the include directory, and everything is done.
Macro
Macro | Value | Details |
MUDA_CHECK_ON | 1 (default) or 0 | MUDA_CHECK_ON=1 for turn on all muda runtime check(for safety) |
MUDA_WITH_COMPUTE_GRAPH | 1 or0 (default) | MUDA_WITH_COMPUTE_GRAPH=1 for turn on muda compute graph feature |
If you manually copy the header files, don't forget to define the macros yourself. If you use cmake or xmake, just set the project dependency to muda.
Tutorial
[TODO]
Examples
All examples in muda/example
are self-explanatory, enjoy it.
Contributing
Contributions are welcome. We are looking for or are working on:
- muda development
- fancy simulation demos using muda
- better documentation of muda
Related Work