TExpr Compiler Part 1 - Motivation, introduction and theory.
Motivation
I have worked with several tensor languages / GPU domain specific languages (Trition, Hidet, futhark, XLA/Jax), and read the papers about several more (Tiramisu, tvm, tensor comprehensions). However, every one of them have limitations that I think fail to address the problems faced by performance engineers at ML companies.
- I need a tool that is stand-alone and is entirely ahead of time compiled. It should generate CUDA or equivalent program that can be easily integrated into a C++/Rust codebase that uses a mish-mash of other libraries like libtorch and TensorRT.
- This is especially relevant for those companies looking to deploy models in latency sensitive situations where JIT compilation is not acceptable.
- Also relevant is the fact that we often need a static artifact that can be better verified for correctness, for safety critical situations.
- I need a tool that simplifies CUDA programming, not one that tries to replace CUDA programming or performance engineers. Current approaches that do so are too high-level and thus are restricted in terms of the programs that can be written and the kind of control over the optimizations available.
- all of the tools mentioned either operate at a tile-level where shared memory is not exposed as an abstraction, or at an even higher level where it is not clear what is a single kernel, what parts are in on-chip memory or in registers
- All of the tools, either using polyhedral techniques or traditional compiler passes can do some amount of optimizations. Automatic parallelization, memory promotion, layout selection, pipelining, using tensor cores. But Nvidia has gone crazy and added new features in ptx that I think will be extremely difficult to schedule from such high-level representations. Some of these features that I think are useful are:
- CTA clusters. Another level of parallelism, so now we have [grids, clusters, warps, threads]
- mbarrier. Using arrive/wait semantics on tokens in shared memory to synchronize arbitrary set of threads
- more async instructions. cp.async, st.async are asynchronous transactions between global and shared. (There’s also TMA, and tcgen, which are bulk tensor copying async instructions and async mma instructions on blackwell, but haven’t looked too much into those)
- special red/atomic instructions.
- Some resources for advanced Nvidia GPU features:
- https://research.colfax-intl.com/cutlass-tutorial-design-of-a-gemm-kernel/
- https://research.colfax-intl.com/cutlass-tutorial-writing-gemm-kernels-using-tensor-memory-for-nvidia-blackwell-gpus/
- https://research.colfax-intl.com/flashattention-3-fast-and-accurate-attention-with-asynchrony-and-low-precision/
Introduction
Contrary to tensor comprehensions and tvm I have opted for a lower-level representation that exposes mutable memory semantics with the introduction of buffers. This IR is similar to MLIR’s affine and memref dialects.
The TE representation. N represents a concrete natural. $i
the symbolic name of indices, $v
the symbolic name of tensors.
tstmt s = par N $i [s]
| seq N $i [s]
| bind $v iexpr texpr -- equivalent to v[iexpr] = texpr
| new $v N
| anchor texpr
texpr e = view $v iexpr -- equivalent to v[iexpr]
| op [e]
iexpr i = I $i -- the iteration index
| C Z -- constant integer
| Neg i
| Add i i
| Mul i i
| Div i i -- floor division
For the same of simplicity I’ve omitted the representation of arithmetic operations on tensor values, and currently this representation does not admit indirect tensor accesses, like A[I[i]]
.
Here’s an example program:
par M i {
par N j {
seq K k {
C[i * N + j] += A[i * K + k] * B[k * N + j]
}
}
}
I’ve choosen this representation due to compositionality, which is enforced by its (informal) semantics.
- all ‘threads’ are assumed to be synced at the beginning of a
par
block. - every thread in a par block is assumed to operate entirely independently, and is joined at the end of the block, so all memory effects are assumed to be visible at the conclusion of a
par
block. - there is no guarantee that threads are fully parallel in a
par
block.
This representation is expressive enough (I think), to support the kind of “inline assembly” that would enable the latest nvidia features, yet at the same time simplify aspects of GPU programming like memory access / planning, async operations and more generally scheduling. I’m envisioning an extension system like:
fn mma(a: f16[M, K], b: f16[K, N]): f16[M, N] {
fully par 32 |i| #align(32) #block { -- forces the code to execute in a block, as a warp
reg a = ...
reg b = ...
reg c = ...
asm(
"mma.m16n8k16...", a, b, c
) -- warp level instruction
}
}
Currently, the design of the language is aimed to improve compositionality as well as help with scheduling, tuning and layout selection, and thus the par
directive only guarantees concurrency and not full parallelism. Parts of the parallel loop may be simulated with sequential loops either due to unavailable hardware threads or to improve instruction level parallelism. This is similar to how CUDA blocks are not guaranteed to be fully parallel (fully resident on sms) and can pose a problem for algorithms that depend on this guarantee (typically algorithms that use atomics as a lock). Additionally, due to the semantics of par
, the compiler is free to reorder iterations of par
so consecutive indices may not map to consecutive hardware threads ids, this also poses a problem for certain warp-level primitives. But since these are rare use cases I think it is sufficient to introduce a fully par
directive with additional pragmas to force the block underneath to be executed as one would expect, with the restriction that no other fully par
blocks can be nested under these.
With this specification the problem is to map physical hardware threads / block ids to logical par
dimensions. Insert synchonization directives, and figure out the memory locations of buffers (whether they can be promoted to shared memory or registers).
Theory and implementation
To figure out how to map physical threads to logical threads, I think it is possible to encode this as a quadratic integer program and use tools from polyhedral compilation. Suppose at level \(i\) we have \(T_i\) logical threads, for a par N |i| {e(i)}
statement at that level, we must find a sequential factor \(S\), such that \(NT_{i+1} = T_iS\), and a permutation \(\sigma_\theta: \mathbb Z_{T_i} \times \mathbb Z_S \to \mathbb Z_{T_{i+1}} \times \mathbb Z_N\), where \(i, t' = \sigma_\theta(t, s)\), so we transform the program into
-- 0 <= t < T_i from upper levels
seq S |s| {
i, t' = sigma(t, s)
e(i)
}
where \(t'\) is used inductively for any inner levels of par
in e(i)
. At level 0, where there are no more outer par
blocks, \(T_0\) is the number of hardware threads. Alternatively, we can find parallel bands and apply this algorithm on each band for each level of parallelism offered by the hardware.
-- first parallel band start: map to CUDA blocks
par N0 |i| {
par M0 |j| {
-- first parallel band end
-- second parallel band start: map to CUDA threads
par TN |it| {
par TM |jt| {
-- second parallel band end
}
}
}
}
We can do this through compiler pragmas initially, or heuristics that use polyhedral techniques to determine the dependence volume.
Once we have decided on the initial \(T_0\), for each par
instance we have a bunch parameterized permutations \(\sigma_{\theta_i}(t_i, s_i)\), which relate loop induction indices \(\mathbf i = (i_0, \dots, i_n, j_0, \dots, j_m)\) in terms of \(0 \leq s_i < S_i\) seqential induction variables and the initial \(0 \leq t < T_0\) parallel index variable, where \(i_k\) represent original parallel indices and \(j_k\) represent user introduced sequential indices. Currently, seq
loops are never parallelized automatically. In our program we have access functions of the form \(f_i(\mathbf i) \in \mathbb Z_{m}\) which relate the loop induction variable to the address accessed for the buffers and captures reads and writes.
If the parameteric permutations \(\sigma_{\theta_i}\) and access functions \(f_i\) are quasi affine in terms of their index variables, then we can express the relations of their inputs and outputs as a presburger relation of the form \(D = \{ \mathbf i \to t : A(\theta)(i, t) + b \geq 0 \}\), noting the dependence on the parameters \(\theta\). Similarly, we can express optimization objectives functions like coalesced memory access as a presburger relation. For example, for a loop of the form
par N0 |i0| {
seq M0 |j0| {
...
par Nk |ik| {
A[f(i0, ..., ik, j0, ..., jm)]
}
}
}
v
-- t = threadIdx.x
seq S0 |s0| {
i0, t0 = sigma0(t, s0)
seq M0 |j0| {
...
seq Sk |sk| {
ik, tk = sigmak(t_{k-1}, sk)
A[f(i0, ..., ik, j0, ..., jm)]
}
}
}
we can refactor the permutations into a relation of the form
\[\Sigma = \{ (t, s_0, \dots, s_k) \to (i_0, \dots, i_k) : A(\theta) [t, \mathbf s, \mathbf i]^T + b \geq 0 \}\]since the composition of quasi-affine functions is quasi affine. Similarly, we can compose this with the access relation \(A = \{ (\mathbf i, \mathbf j) \to z : z = f(\mathbf i, \mathbf j) \}\) to get
\(\mathbf A = \{ (t, \mathbf s, \mathbf j) \to z : \exists v : A(\theta) [t, \mathbf s, \mathbf j, \mathbf v, z]^T + b \geq 0 \}\).
Where \(\mathbf v\) are the intermediate relation variables. Then the function that maps the iteration variables \((t, \mathbf s, \mathbf j)\) to the access index is a simple affine form:
\[f'((t, \mathbf s, \mathbf j) \to z) = z : (t, \mathbf s, \mathbf j) \to z \in \mathbf A\]To encode the memory coalescing objective we write another quasi-affine form:
\[\phi(\mathbf s, \mathbf j) = \sum_{t=0}^{T/32-1} \sum_{i=0}^{31} |f'(32t + i + 1) - f'(32t + i)|\]which minimizes the neighbourhood distance accessed by consecutive hardware threads. We want to state something like \(\min_{\theta} \sum_{(\mathbf s, \mathbf j) \in D} \phi(\mathbf s, \mathbf j)\), but this is too expensive for large iteration domains. Another approach is \(\min d : \forall (\mathbf s, \mathbf j) \in D, \phi(\mathbf s, \mathbf j) \leq d\), which is the direction proposed by featurier [1]. First we turn the quasi-affine form \(d - \phi(\mathbf s, \mathbf j)\) into another linear form \(\psi\) over the relation \(\{ (\mathbf s, \mathbf j, d) \to y : d - \phi(\mathbf s, \mathbf j) = y \} = \{ (\mathbf s, \mathbf j, \mathbf e, d, y) : A(\theta)(\mathbf s, \mathbf j, \mathbf e, d, y) + b \geq 0 \} = D'\), since we can convert a quasi-affine form into a relation with extra existential variables (\(\mathbf e\)). Then \(\psi(\mathbf s, \mathbf j, \mathbf e, d, y) = y\), simple.
The minimization problem turns into \(\min d : \forall (\mathbf s, \mathbf j, \mathbf e, d, y) \in D', \psi(\mathbf s, \mathbf j, d, y) \geq 0\). By farkas’ lemma, \(\exists \lambda, \lambda_0 \geq 0 : \psi(\mathbf s, \mathbf j, \mathbf e, d, y) = \lambda^T(A(\theta)(\mathbf s, \mathbf j, \mathbf e, d, y) + b) + \lambda_0\). Equating the two we get \(\lambda^T(A(\theta)(\mathbf s, \mathbf j, \mathbf e, d, y)^T + b) + \lambda_0 = y\), and \(\lambda, \lambda_0 \geq 0\) as constraints. Since there is an implicit forall quantification over iteration variables \((\mathbf s, \mathbf j, \mathbf e, d, y)\), we can match terms on the left and right hand side to get \(\lambda^Tb + \lambda_0 = 0\), \((\lambda^TA(\theta))_{-1} = 1\) and \((\lambda^TA(\theta))_{1:-1} = 0\). Depending on what \(A(\theta)\) is, this is at worst a quadratic integer program.
Parameterization of \(\sigma\)
In the formulation above, \(\sigma\) must be a quasi-affine function of its inputs, but can be a polynomial of arbitrary degree in terms of parameters, and the resulting optimization problem will be at worst a quadratic integer program. How well this works in practice I’m not too sure since quadratic integer programs can be very difficult to solve, and introducing over parameterized \(\sigma\) may lead to a lot of existential variables (amount linear in the number of operations).
References
[1] Feautrier, Paul. (1996). Some efficient solutions to the affine scheduling problem Part I One-dimensional Time. International Journal of Parallel Programming. 21. 10.1007/BF01407835.