Deep Learning Recommendation Model is a Deep Neural Network for personalization and recommendation developed by Facebook. Due to its importance for large internet companies, it is one of the benchmarks in MLPerf. It is a simple network with 26 Embedding layers, 8 Linear layers and a Batched Matrix Multiply for calculating pairwise interactions. However, a single training step (forward and backward propagation) invokes about 800 GPU CUDA kernels (depending on the batch size). The best way to understand a DL network and GPU performance is to understand every single CUDA kernel i.e. which layer of the network invoked the kernel, with what arguments (tensor shapes and datatypes) and in which direction (forward propagation or backward propagation).
In this blog post, I will categorize every kernel used in the training of
DLRM. All the information in the tables below was obtaining using Nvidia's
PyTorch Profiler, PyProf, on a Turing T4 GPU. The information below
is only a subset of what is provided by PyProf. The code and instructions
for obtaining a detailed profile are here. Note that different GPUs
will have slightly different kernel names e.g. volta_* as opposed to
turing_*.
The parameters in the profiled code are as follows. These parameters
were obtained from Nvidia Deep Learning Examples. The only
constraint is that the output of the bottom MLP should be equal to the
embedding size. The vocabulary sizes of the categorical embeddings are
for a synthetic dataset and not the real Criteo Terabyte Dataset.
For the purpose of this article, I modified the shape of the categorical
embeddings from [100000]*26 to [100000 + 100*x for x in range(26)],
so that its easier to differentiate and understand the kernels
associated with each of the 26 embeddings, both in forward and backward
propagation. With the Criteo dataset, the categorical embeddings tables
become really big. With an embedding size of 128, the embedding tables have ~
24 Billion parameters ~ 96 GB, and it is not possible to store them on
a single GPU.
Bottom MLP = [512, 256, 128]
Top MLP = [1024, 1024, 512, 256, 1]
Numerical Features (N) = 13
Categorical Features (M) = 26
Categorical Embeddings = [100000 + 100*x for x in range(26)]
Embedding Size = 128
Batch Size = 32768
The tables below show the GPU kernels invoked in 1 training step. For every GPU kernel we show the direction (fprop, bprop), name of the layer, name of the operation, and the input tensor shapes / matrix dimensions for the operation. PyProf provides a lot of additional information for every GPU kernel e.g. grid dimensions, block dimensions, silicon time, datatypes, flops, bytes, tensor core usage and so on.
Kernels 1 through 11 correspond to the forward propagation through the 3 Linear layers in the Bottom MLP. Kernels 2, 3 and 4 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 5-7, 8-10 correspond to the second and third Linear layers respectively. Kernels 1 and 11 correspond to data type conversions from fp32 to fp16 and back.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 1 | fprop | Bottom_MLP | to | T=(32768,13) | legacy::elementwise_kernel |
| 2 | fprop | Bottom_MLP | bias | M=512,N=32768 | legacy::elementwise_kernel |
| 3 | fprop | Bottom_MLP | linear | M=512,N=32768,K=13 | volta_fp16_sgemm_fp16_128x128_tn |
| 4 | fprop | Bottom_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
| 5 | fprop | Bottom_MLP | bias | M=256,N=32768 | legacy::elementwise_kernel |
| 6 | fprop | Bottom_MLP | linear | M=256,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn |
| 7 | fprop | Bottom_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
| 8 | fprop | Bottom_MLP | bias | M=128,N=32768 | legacy::elementwise_kernel |
| 9 | fprop | Bottom_MLP | linear | M=128,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
| 10 | fprop | Bottom_MLP | relu | T=(32768,128) | modern::elementwise_kernel |
| 11 | fprop | Bottom_MLP | to | T=(32768,128) | legacy::elementwise_kernel |
Kernel 12 corresponds to a data type cast operation on the index tensor. Kernels 13 through 38 correspond to the forward propagation through the 26 Embedding layers. You can notice the size of the index tensor (batch size) and the size of the embedding tables (vocab x embedding size).
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 12 | fprop | - | to | T=(32768,26) | legacy::elementwise_kernel |
| 13 | fprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | indexSelectLargeIndex |
| 14 | fprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | indexSelectLargeIndex |
| 15 | fprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | indexSelectLargeIndex |
| 16 | fprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | indexSelectLargeIndex |
| 17 | fprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | indexSelectLargeIndex |
| 18 | fprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | indexSelectLargeIndex |
| 19 | fprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | indexSelectLargeIndex |
| 20 | fprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | indexSelectLargeIndex |
| 21 | fprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | indexSelectLargeIndex |
| 22 | fprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | indexSelectLargeIndex |
| 23 | fprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | indexSelectLargeIndex |
| 24 | fprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | indexSelectLargeIndex |
| 25 | fprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | indexSelectLargeIndex |
| 26 | fprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | indexSelectLargeIndex |
| 27 | fprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | indexSelectLargeIndex |
| 28 | fprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | indexSelectLargeIndex |
| 29 | fprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | indexSelectLargeIndex |
| 30 | fprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | indexSelectLargeIndex |
| 31 | fprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | indexSelectLargeIndex |
| 32 | fprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | indexSelectLargeIndex |
| 33 | fprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | indexSelectLargeIndex |
| 34 | fprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | indexSelectLargeIndex |
| 35 | fprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | indexSelectLargeIndex |
| 36 | fprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | indexSelectLargeIndex |
| 37 | fprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | indexSelectLargeIndex |
| 38 | fprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | indexSelectLargeIndex |
Kernels 39 through 44 correspond to the pairwise interaction layer. Pairwise interaction is implemented as a dot product of every vector with all other vectors. If there are $N$ vectors (of the same length), then we will have ${N \choose 2} = \frac{N\times(N-1)}{2}$ interactions (scalar outputs). This can be implemented by creating a matrix $A$ containing the $N$ vectors, calculating $A.A^T$ and then taking the lower (or upper) traingular matrix, since the output is symmetric. In some sense, this is similar to attention. When this operation is done for multiple batches (32768 in this example), it is implemented as a Batched Matrix Multiplication (BMM).
Kernel 39 concatenates the outputs of the 26 embedding lookups and the Bottom MLP, resulting in 27 vectors. Kernel 40 computes the pairwise interaction of 27 vectors, each of length 128 using a BMM operation. Kernel 41 extract the lower (or upper) triangular matrix, which gives us ${27 \choose 2} = 351$ values.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 39 | fprop | Interaction | cat | T=[(32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128)] | CatArrayBatchedCopy |
| 40 | fprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_tn |
| 41 | fprop | Interaction | index | na=na | legacy::elementwise_kernel |
Kernels 42 through 44 concatenate the output of the Pairwise Interaction layer with the output of the Bottom MLP and pad it with 1 element before feeding it to the Top MLP.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 42 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
| 43 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
| 44 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
Kernels 45 through 60 correspond to the forward propagation through the 5 Linear layers in the Top MLP. Kernels 46, 47 and 48 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 49-51, 52-54, 55-57, 58-59 correspond to the second, third, fourth and fifth Linear layers respectively. Kernels 45 and 60 correspond to data type conversions from fp32 to fp16 and back.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 45 | fprop | Top_MLP | to | T=(32768,480) | legacy::elementwise_kernel |
| 46 | fprop | Top_MLP | bias | M=1024,N=32768 | legacy::elementwise_kernel |
| 47 | fprop | Top_MLP | linear | M=1024,N=32768,K=480 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
| 48 | fprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
| 49 | fprop | Top_MLP | bias | M=1024,N=32768 | legacy::elementwise_kernel |
| 50 | fprop | Top_MLP | linear | M=1024,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
| 51 | fprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
| 52 | fprop | Top_MLP | bias | M=512,N=32768 | legacy::elementwise_kernel |
| 53 | fprop | Top_MLP | linear | M=512,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
| 54 | fprop | Top_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
| 55 | fprop | Top_MLP | bias | M=256,N=32768 | legacy::elementwise_kernel |
| 56 | fprop | Top_MLP | linear | M=256,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn |
| 57 | fprop | Top_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
| 58 | fprop | Top_MLP | bias | M=1,N=32768 | legacy::elementwise_kernel |
| 59 | fprop | Top_MLP | linear | M=1,N=32768,K=256 | gemv2T_kernel_val |
| 60 | fprop | Top_MLP | to | T=(32768,1) | legacy::elementwise_kernel |
Kernels 61 through 75 compute the BCE loss.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 61 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 62 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
| 63 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 64 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
| 65 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 66 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 67 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 68 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
| 69 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 70 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 71 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
| 72 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 73 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 74 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 75 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | reduce_kernel |
In this profile, we are not accumulating gradients, and hence kernels 76 through 117 zero out the buffers used for storing the gradients during back propagation. Note that there are separate buffers for storing the gradients of the weights, biases and embeddings. Kernels 76-81, 82-91, 92-117 correspond to the Bottom MLP, Top MLP and the Embedding layers respectively.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 76 | fprop | - | zero | T=[(512,13)] | modern::elementwise_kernel |
| 77 | fprop | - | zero | T=[(512)] | modern::elementwise_kernel |
| 78 | fprop | - | zero | T=[(256,512)] | modern::elementwise_kernel |
| 79 | fprop | - | zero | T=[(256)] | modern::elementwise_kernel |
| 80 | fprop | - | zero | T=[(128,256)] | modern::elementwise_kernel |
| 81 | fprop | - | zero | T=[(128)] | modern::elementwise_kernel |
| 82 | fprop | - | zero | T=[(1024,480)] | modern::elementwise_kernel |
| 83 | fprop | - | zero | T=[(1024)] | modern::elementwise_kernel |
| 84 | fprop | - | zero | T=[(1024,1024)] | modern::elementwise_kernel |
| 85 | fprop | - | zero | T=[(1024)] | modern::elementwise_kernel |
| 86 | fprop | - | zero | T=[(512,1024)] | modern::elementwise_kernel |
| 87 | fprop | - | zero | T=[(512)] | modern::elementwise_kernel |
| 88 | fprop | - | zero | T=[(256,512)] | modern::elementwise_kernel |
| 89 | fprop | - | zero | T=[(256)] | modern::elementwise_kernel |
| 90 | fprop | - | zero | T=[(1,256)] | modern::elementwise_kernel |
| 91 | fprop | - | zero | T=[(1)] | modern::elementwise_kernel |
| 92 | fprop | - | zero | T=[(100000,128)] | modern::elementwise_kernel |
| 93 | fprop | - | zero | T=[(100100,128)] | modern::elementwise_kernel |
| 94 | fprop | - | zero | T=[(100200,128)] | modern::elementwise_kernel |
| 95 | fprop | - | zero | T=[(100300,128)] | modern::elementwise_kernel |
| 96 | fprop | - | zero | T=[(100400,128)] | modern::elementwise_kernel |
| 97 | fprop | - | zero | T=[(100500,128)] | modern::elementwise_kernel |
| 98 | fprop | - | zero | T=[(100600,128)] | modern::elementwise_kernel |
| 99 | fprop | - | zero | T=[(100700,128)] | modern::elementwise_kernel |
| 100 | fprop | - | zero | T=[(100800,128)] | modern::elementwise_kernel |
| 101 | fprop | - | zero | T=[(100900,128)] | modern::elementwise_kernel |
| 102 | fprop | - | zero | T=[(101000,128)] | modern::elementwise_kernel |
| 103 | fprop | - | zero | T=[(101100,128)] | modern::elementwise_kernel |
| 104 | fprop | - | zero | T=[(101200,128)] | modern::elementwise_kernel |
| 105 | fprop | - | zero | T=[(101300,128)] | modern::elementwise_kernel |
| 106 | fprop | - | zero | T=[(101400,128)] | modern::elementwise_kernel |
| 107 | fprop | - | zero | T=[(101500,128)] | modern::elementwise_kernel |
| 108 | fprop | - | zero | T=[(101600,128)] | modern::elementwise_kernel |
| 109 | fprop | - | zero | T=[(101700,128)] | modern::elementwise_kernel |
| 110 | fprop | - | zero | T=[(101800,128)] | modern::elementwise_kernel |
| 111 | fprop | - | zero | T=[(101900,128)] | modern::elementwise_kernel |
| 112 | fprop | - | zero | T=[(102000,128)] | modern::elementwise_kernel |
| 113 | fprop | - | zero | T=[(102100,128)] | modern::elementwise_kernel |
| 114 | fprop | - | zero | T=[(102200,128)] | modern::elementwise_kernel |
| 115 | fprop | - | zero | T=[(102300,128)] | modern::elementwise_kernel |
| 116 | fprop | - | zero | T=[(102400,128)] | modern::elementwise_kernel |
| 117 | fprop | - | zero | T=[(102500,128)] | modern::elementwise_kernel |
Kernels 118 through 122 most likely correspond to loss scaling and the beginning of back propagation.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 118 | fprop | - | __imul__ | T=[(1)] | legacy::elementwise_kernel |
| 119 | fprop | - | __mul__ | T=[(1)] | legacy::elementwise_kernel |
| 120 | fprop | - | backward | - | legacy::elementwise_kernel |
| 121 | bprop | - | __mul__ | T=[(1)] | legacy::elementwise_kernel |
| 122 | bprop | - | __imul__ | T=[(1)] | legacy::elementwise_kernel |
Kernels 123 through 126 correspond to the backward propagation through the Loss layer.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 123 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
| 124 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
| 125 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
| 126 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
Kernels 127 through 149 correspond to the backward propagation through the Top MLP layer. Kernels 128-130 calculate the data gradient and weight gradient through the last Linear layer. Kernel 131 calculates the bias gradient. Likewise kernels 132-136, 137-140, 141-144, 145-148 correspond to the backward propagation through the fourth, third, second, and the first Linear layer of the Top MLP respectively.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 127 | bprop | - | to | na=na | legacy::elementwise_kernel |
| 128 | bprop | Top_MLP | linear | M=256,N=32768,K=1 | gemmk1_kernel |
| 129 | bprop | Top_MLP | linear | M=256,N=1,K=32768 | gemv2N_kernel |
| 130 | bprop | Top_MLP | linear | X=(32768,256),W=(1,256) | splitKreduce_kernel |
| 131 | bprop | - | sum | na=na | reduce_kernel |
| 132 | bprop | Top_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
| 133 | bprop | Top_MLP | linear | M=512,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 134 | bprop | Top_MLP | linear | M=512,N=256,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
| 135 | bprop | Top_MLP | linear | X=(32768,512),W=(256,512) | splitKreduce_kernel |
| 136 | bprop | - | sum | na=na | reduce_kernel |
| 137 | bprop | Top_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
| 138 | bprop | Top_MLP | linear | M=1024,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 139 | bprop | Top_MLP | linear | M=1024,N=512,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
| 140 | bprop | - | sum | na=na | reduce_kernel |
| 141 | bprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
| 142 | bprop | Top_MLP | linear | M=1024,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 143 | bprop | Top_MLP | linear | M=1024,N=1024,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
| 144 | bprop | - | sum | na=na | reduce_kernel |
| 145 | bprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
| 146 | bprop | Top_MLP | linear | M=480,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 147 | bprop | Top_MLP | linear | M=480,N=1024,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
| 148 | bprop | - | sum | na=na | reduce_kernel |
| 149 | bprop | - | to | na=na | legacy::elementwise_kernel |
Kernels 150 through 165 correspond to the backward propagation through the pairwise interaction layer. Kernels 150-161 probably correspond to the lower (or upper) triangular matrix indexing. Kernels 162 and 163 correspond to the two gradients through the BMM operation. The gradients correspond to the same input and therefore kernels 164 and 165 are probably adding the gradients.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 150 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
| 151 | bprop | Interaction | index | na=na | legacy::elementwise_kernel |
| 152 | bprop | Interaction | index | na=na | kernelPointwiseApply2 |
| 153 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
| 154 | bprop | Interaction | index | na=na | kernelPointwiseApply2 |
| 155 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
| 156 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
| 157 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
| 158 | bprop | Interaction | index | na=na | thrust::cuda_cub::core::_kernel_agent |
| 159 | bprop | Interaction | index | na=na | thrust::cuda_cub::core::_kernel_agent |
| 160 | bprop | Interaction | index | na=na | indexing_backward_kernel |
| 161 | bprop | - | Slice | na=na | modern::elementwise_kernel |
| 162 | bprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_nt |
| 163 | bprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_nn |
| 164 | bprop | - | add | na=na | legacy::elementwise_kernel |
| 165 | bprop | - | add | na=na | legacy::elementwise_kernel |
Kernels 166 through 711 correspond to the backward propagation through the 26 Embedding layers.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 166 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | legacy::elementwise_kernel |
| 167 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | legacy::elementwise_kernel |
| 168 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 169 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 170 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 171 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 172 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 173 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 174 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 175 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 176 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 177 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 178 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | modern::elementwise_kernel |
| 179 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 180 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 181 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | krn_partials_per_segment |
| 182 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 183 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
| 184 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | krn_partial_segment_offset |
| 185 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | compute_grad_weight |
| 186 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 187 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | legacy::elementwise_kernel |
| 188 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | legacy::elementwise_kernel |
| 189 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 190 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 191 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 192 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 193 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 194 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 195 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 196 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 197 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 198 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 199 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | modern::elementwise_kernel |
| 200 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 201 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 202 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | krn_partials_per_segment |
| 203 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 204 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
| 205 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | krn_partial_segment_offset |
| 206 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | compute_grad_weight |
| 207 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 208 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | legacy::elementwise_kernel |
| 209 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | legacy::elementwise_kernel |
| 210 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 211 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 212 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 213 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 214 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 215 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 216 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 217 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 218 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 219 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 220 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | modern::elementwise_kernel |
| 221 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 222 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 223 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | krn_partials_per_segment |
| 224 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 225 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
| 226 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | krn_partial_segment_offset |
| 227 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | compute_grad_weight |
| 228 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 229 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | legacy::elementwise_kernel |
| 230 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | legacy::elementwise_kernel |
| 231 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 232 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 233 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 234 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 235 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 236 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 237 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 238 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 239 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 240 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 241 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | modern::elementwise_kernel |
| 242 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 243 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 244 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | krn_partials_per_segment |
| 245 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 246 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
| 247 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | krn_partial_segment_offset |
| 248 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | compute_grad_weight |
| 249 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 250 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | legacy::elementwise_kernel |
| 251 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | legacy::elementwise_kernel |
| 252 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 253 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 254 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 255 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 256 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 257 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 258 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 259 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 260 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 261 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 262 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | modern::elementwise_kernel |
| 263 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 264 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 265 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | krn_partials_per_segment |
| 266 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 267 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
| 268 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | krn_partial_segment_offset |
| 269 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | compute_grad_weight |
| 270 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 271 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | legacy::elementwise_kernel |
| 272 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | legacy::elementwise_kernel |
| 273 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 274 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 275 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 276 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 277 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 278 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 279 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 280 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 281 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 282 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 283 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | modern::elementwise_kernel |
| 284 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 285 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 286 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | krn_partials_per_segment |
| 287 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 288 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
| 289 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | krn_partial_segment_offset |
| 290 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | compute_grad_weight |
| 291 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 292 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | legacy::elementwise_kernel |
| 293 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | legacy::elementwise_kernel |
| 294 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 295 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 296 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 297 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 298 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 299 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 300 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 301 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 302 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 303 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 304 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | modern::elementwise_kernel |
| 305 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 306 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 307 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | krn_partials_per_segment |
| 308 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 309 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
| 310 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | krn_partial_segment_offset |
| 311 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | compute_grad_weight |
| 312 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 313 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | legacy::elementwise_kernel |
| 314 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | legacy::elementwise_kernel |
| 315 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 316 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 317 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 318 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 319 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 320 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 321 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 322 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 323 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 324 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 325 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | modern::elementwise_kernel |
| 326 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 327 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 328 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | krn_partials_per_segment |
| 329 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 330 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
| 331 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | krn_partial_segment_offset |
| 332 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | compute_grad_weight |
| 333 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 334 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | legacy::elementwise_kernel |
| 335 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | legacy::elementwise_kernel |
| 336 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 337 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 338 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 339 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 340 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 341 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 342 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 343 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 344 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 345 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 346 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | modern::elementwise_kernel |
| 347 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 348 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 349 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | krn_partials_per_segment |
| 350 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 351 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
| 352 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | krn_partial_segment_offset |
| 353 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | compute_grad_weight |
| 354 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 355 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | legacy::elementwise_kernel |
| 356 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | legacy::elementwise_kernel |
| 357 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 358 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 359 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 360 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 361 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 362 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 363 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 364 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 365 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 366 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 367 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | modern::elementwise_kernel |
| 368 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 369 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 370 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | krn_partials_per_segment |
| 371 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 372 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
| 373 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | krn_partial_segment_offset |
| 374 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | compute_grad_weight |
| 375 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 376 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | legacy::elementwise_kernel |
| 377 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | legacy::elementwise_kernel |
| 378 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 379 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 380 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 381 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 382 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 383 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 384 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 385 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 386 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 387 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 388 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | modern::elementwise_kernel |
| 389 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 390 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 391 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | krn_partials_per_segment |
| 392 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 393 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
| 394 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | krn_partial_segment_offset |
| 395 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | compute_grad_weight |
| 396 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 397 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | legacy::elementwise_kernel |
| 398 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | legacy::elementwise_kernel |
| 399 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 400 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 401 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 402 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 403 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 404 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 405 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 406 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 407 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 408 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 409 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | modern::elementwise_kernel |
| 410 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 411 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 412 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | krn_partials_per_segment |
| 413 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 414 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
| 415 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | krn_partial_segment_offset |
| 416 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | compute_grad_weight |
| 417 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 418 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | legacy::elementwise_kernel |
| 419 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | legacy::elementwise_kernel |
| 420 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 421 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 422 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 423 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 424 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 425 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 426 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 427 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 428 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 429 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 430 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | modern::elementwise_kernel |
| 431 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 432 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 433 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | krn_partials_per_segment |
| 434 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 435 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
| 436 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | krn_partial_segment_offset |
| 437 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | compute_grad_weight |
| 438 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 439 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | legacy::elementwise_kernel |
| 440 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | legacy::elementwise_kernel |
| 441 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 442 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 443 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 444 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 445 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 446 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 447 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 448 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 449 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 450 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 451 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | modern::elementwise_kernel |
| 452 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 453 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 454 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | krn_partials_per_segment |
| 455 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 456 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
| 457 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | krn_partial_segment_offset |
| 458 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | compute_grad_weight |
| 459 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 460 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | legacy::elementwise_kernel |
| 461 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | legacy::elementwise_kernel |
| 462 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 463 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 464 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 465 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 466 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 467 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 468 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 469 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 470 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 471 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 472 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | modern::elementwise_kernel |
| 473 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 474 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 475 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | krn_partials_per_segment |
| 476 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 477 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
| 478 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | krn_partial_segment_offset |
| 479 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | compute_grad_weight |
| 480 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 481 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | legacy::elementwise_kernel |
| 482 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | legacy::elementwise_kernel |
| 483 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 484 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 485 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 486 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 487 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 488 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 489 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 490 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 491 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 492 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 493 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | modern::elementwise_kernel |
| 494 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 495 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 496 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | krn_partials_per_segment |
| 497 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 498 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
| 499 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | krn_partial_segment_offset |
| 500 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | compute_grad_weight |
| 501 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 502 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | legacy::elementwise_kernel |
| 503 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | legacy::elementwise_kernel |
| 504 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 505 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 506 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 507 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 508 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 509 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 510 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 511 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 512 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 513 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 514 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | modern::elementwise_kernel |
| 515 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 516 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 517 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | krn_partials_per_segment |
| 518 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 519 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
| 520 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | krn_partial_segment_offset |
| 521 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | compute_grad_weight |
| 522 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 523 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | legacy::elementwise_kernel |
| 524 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | legacy::elementwise_kernel |
| 525 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 526 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 527 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 528 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 529 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 530 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 531 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 532 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 533 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 534 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 535 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | modern::elementwise_kernel |
| 536 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 537 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 538 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | krn_partials_per_segment |
| 539 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 540 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
| 541 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | krn_partial_segment_offset |
| 542 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | compute_grad_weight |
| 543 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 544 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | legacy::elementwise_kernel |
| 545 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | legacy::elementwise_kernel |
| 546 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 547 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 548 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 549 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 550 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 551 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 552 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 553 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 554 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 555 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 556 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | modern::elementwise_kernel |
| 557 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 558 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 559 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | krn_partials_per_segment |
| 560 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 561 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
| 562 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | krn_partial_segment_offset |
| 563 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | compute_grad_weight |
| 564 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 565 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | legacy::elementwise_kernel |
| 566 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | legacy::elementwise_kernel |
| 567 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 568 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 569 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 570 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 571 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 572 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 573 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 574 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 575 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 576 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 577 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | modern::elementwise_kernel |
| 578 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 579 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 580 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | krn_partials_per_segment |
| 581 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 582 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
| 583 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | krn_partial_segment_offset |
| 584 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | compute_grad_weight |
| 585 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 586 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | legacy::elementwise_kernel |
| 587 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | legacy::elementwise_kernel |
| 588 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 589 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 590 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 591 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 592 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 593 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 594 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 595 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 596 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 597 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 598 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | modern::elementwise_kernel |
| 599 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 600 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 601 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | krn_partials_per_segment |
| 602 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 603 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
| 604 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | krn_partial_segment_offset |
| 605 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | compute_grad_weight |
| 606 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 607 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | legacy::elementwise_kernel |
| 608 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | legacy::elementwise_kernel |
| 609 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 610 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 611 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 612 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 613 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 614 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 615 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 616 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 617 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 618 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 619 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | modern::elementwise_kernel |
| 620 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 621 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 622 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | krn_partials_per_segment |
| 623 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 624 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
| 625 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | krn_partial_segment_offset |
| 626 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | compute_grad_weight |
| 627 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 628 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | legacy::elementwise_kernel |
| 629 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | legacy::elementwise_kernel |
| 630 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 631 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 632 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 633 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 634 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 635 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 636 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 637 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 638 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 639 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 640 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | modern::elementwise_kernel |
| 641 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 642 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 643 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | krn_partials_per_segment |
| 644 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 645 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
| 646 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | krn_partial_segment_offset |
| 647 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | compute_grad_weight |
| 648 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 649 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | legacy::elementwise_kernel |
| 650 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | legacy::elementwise_kernel |
| 651 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 652 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 653 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 654 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 655 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 656 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 657 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 658 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 659 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 660 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 661 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | modern::elementwise_kernel |
| 662 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 663 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 664 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | krn_partials_per_segment |
| 665 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 666 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
| 667 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | krn_partial_segment_offset |
| 668 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | compute_grad_weight |
| 669 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 670 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | legacy::elementwise_kernel |
| 671 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | legacy::elementwise_kernel |
| 672 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 673 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 674 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 675 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 676 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 677 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 678 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 679 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 680 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 681 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 682 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | modern::elementwise_kernel |
| 683 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 684 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 685 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | krn_partials_per_segment |
| 686 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 687 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
| 688 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | krn_partial_segment_offset |
| 689 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | compute_grad_weight |
| 690 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | sum_and_scatter |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 691 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | legacy::elementwise_kernel |
| 692 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | legacy::elementwise_kernel |
| 693 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 694 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 695 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 696 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 697 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 698 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 699 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 700 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 701 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 702 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 703 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | modern::elementwise_kernel |
| 704 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 705 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 706 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | krn_partials_per_segment |
| 707 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 708 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
| 709 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | krn_partial_segment_offset |
| 710 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | compute_grad_weight |
| 711 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | sum_and_scatter |
Kernels 712 through 726 correspond to the backward propagation through the Bottom MLP layer. Kernel 713 does the backward propagation through the ReLU layer, 714-716 calculate the data gradient and weight gradient through the last Linear layer. Kernel 717 calculates the bias gradient. Likewise kernels 718-722 and 723-726 correspond to the backward propagation through the second and first Linear layer of the Bottom MLP respectively.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 712 | bprop | - | to | na=na | legacy::elementwise_kernel |
| 713 | bprop | Bottom_MLP | relu | T=(32768,128) | modern::elementwise_kernel |
| 714 | bprop | Bottom_MLP | linear | M=256,N=32768,K=128 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 715 | bprop | Bottom_MLP | linear | M=256,N=128,K=32768 | turing_fp16_s1688gemm_fp16_256x128_ldg8_f2f_nt |
| 716 | bprop | Bottom_MLP | linear | X=(32768,256),W=(128,256) | splitKreduce_kernel |
| 717 | bprop | - | sum | na=na | reduce_kernel |
| 718 | bprop | Bottom_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
| 719 | bprop | Bottom_MLP | linear | M=512,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
| 720 | bprop | Bottom_MLP | linear | M=512,N=256,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
| 721 | bprop | Bottom_MLP | linear | X=(32768,512),W=(256,512) | splitKreduce_kernel |
| 722 | bprop | - | sum | na=na | reduce_kernel |
| 723 | bprop | Bottom_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
| 724 | bprop | Bottom_MLP | linear | M=13,N=32768,K=512 | volta_sgemm_fp16_32x32_sliced1x4_nt |
| 725 | bprop | Bottom_MLP | linear | M=13,N=512,K=32768 | splitKreduce_kernel |
| 726 | bprop | - | sum | na=na | reduce_kernel |
At the end of back propagation, we have all the weight gradients. Kernels 727 through 788 correspond to reversing the loss scaling and applying the SGD optimizer on all parameters.
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 727 | bprop | - | zero_ | T=[(1,)] | modern::elementwise_kernel |
| 728 | bprop | - | multi_tensor_scale | T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] | multi_tensor_apply_kernel |
| 729 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 730 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 731 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 732 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 733 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 734 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 735 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 736 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 737 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 738 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 739 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 740 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 741 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 742 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 743 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 744 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| 745 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
| Idx | Direction | Layer | Op | Params | GPU Kernel |
|---|---|---|---|---|---|
| 746 | bprop | - | add_ | T=[(100000,128), (100000,128)] | modern::elementwise_kernel |
| 747 | bprop | - | add_ | T=[(100100,128), (100100,128)] | modern::elementwise_kernel |
| 748 | bprop | - | add_ | T=[(100200,128), (100200,128)] | modern::elementwise_kernel |
| 749 | bprop | - | add_ | T=[(100300,128), (100300,128)] | modern::elementwise_kernel |
| 750 | bprop | - | add_ | T=[(100400,128), (100400,128)] | modern::elementwise_kernel |
| 751 | bprop | - | add_ | T=[(100500,128), (100500,128)] | modern::elementwise_kernel |
| 752 | bprop | - | add_ | T=[(100600,128), (100600,128)] | modern::elementwise_kernel |
| 753 | bprop | - | add_ | T=[(100700,128), (100700,128)] | modern::elementwise_kernel |
| 754 | bprop | - | add_ | T=[(100800,128), (100800,128)] | modern::elementwise_kernel |
| 755 | bprop | - | add_ | T=[(100900,128), (100900,128)] | modern::elementwise_kernel |
| 756 | bprop | - | add_ | T=[(101000,128), (101000,128)] | modern::elementwise_kernel |
| 757 | bprop | - | add_ | T=[(101100,128), (101100,128)] | modern::elementwise_kernel |
| 758 | bprop | - | add_ | T=[(101200,128), (101200,128)] | modern::elementwise_kernel |
| 759 | bprop | - | add_ | T=[(101300,128), (101300,128)] | modern::elementwise_kernel |
| 760 | bprop | - | add_ | T=[(101400,128), (101400,128)] | modern::elementwise_kernel |
| 761 | bprop | - | add_ | T=[(101500,128), (101500,128)] | modern::elementwise_kernel |
| 762 | bprop | - | add_ | T=[(101600,128), (101600,128)] | modern::elementwise_kernel |
| 763 | bprop | - | add_ | T=[(101700,128), (101700,128)] | modern::elementwise_kernel |
| 764 | bprop | - | add_ | T=[(101800,128), (101800,128)] | modern::elementwise_kernel |
| 765 | bprop | - | add_ | T=[(101900,128), (101900,128)] | modern::elementwise_kernel |
| 766 | bprop | - | add_ | T=[(102000,128), (102000,128)] | modern::elementwise_kernel |
| 767 | bprop | - | add_ | T=[(102100,128), (102100,128)] | modern::elementwise_kernel |
| 768 | bprop | - | add_ | T=[(102200,128), (102200,128)] | modern::elementwise_kernel |
| 769 | bprop | - | add_ | T=[(102300,128), (102300,128)] | modern::elementwise_kernel |
| 770 | bprop | - | add_ | T=[(102400,128), (102400,128)] | modern::elementwise_kernel |
| 771 | bprop | - | add_ | T=[(102500,128), (102500,128)] | modern::elementwise_kernel |
| 772 | bprop | - | add_ | T=[(512,13), (512,13)] | modern::elementwise_kernel |
| 773 | bprop | - | add_ | T=[(512), (512,)] | modern::elementwise_kernel |
| 774 | bprop | - | add_ | T=[(256,512), (256,512)] | modern::elementwise_kernel |
| 775 | bprop | - | add_ | T=[(256), (256,)] | modern::elementwise_kernel |
| 776 | bprop | - | add_ | T=[(128,256), (128,256)] | modern::elementwise_kernel |
| 777 | bprop | - | add_ | T=[(128), (128,)] | modern::elementwise_kernel |
| 778 | bprop | - | add_ | T=[(1024,480), (1024,480)] | modern::elementwise_kernel |
| 779 | bprop | - | add_ | T=[(1024), (1024,)] | modern::elementwise_kernel |
| 780 | bprop | - | add_ | T=[(1024,1024), (1024,1024)] | modern::elementwise_kernel |
| 781 | bprop | - | add_ | T=[(1024), (1024,)] | modern::elementwise_kernel |
| 782 | bprop | - | add_ | T=[(512,1024), (512,1024)] | modern::elementwise_kernel |
| 783 | bprop | - | add_ | T=[(512), (512,)] | modern::elementwise_kernel |
| 784 | bprop | - | add_ | T=[(256,512), (256,512)] | modern::elementwise_kernel |
| 785 | bprop | - | add_ | T=[(256), (256,)] | modern::elementwise_kernel |
| 786 | bprop | - | add_ | T=[(1,256), (1,256)] | modern::elementwise_kernel |
| 787 | bprop | - | add_ | T=[(1), (1,)] | modern::elementwise_kernel |
| 788 | bprop | - | multi_tensor_scale | T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] | multi_tensor_apply_kernel |