Now that we’ve introduced multiple PEs into our program, instead of duplicating our GEMV problem between them, let’s actually distribute the work for computing a single GEMV.Documentation Index
Fetch the complete documentation index at: https://sdk.cerebras.ai/llms.txt
Use this file to discover all available pages before exploring further.
Learning objectives
After completing this tutorial, you should know how to:- Use fabric DSDs
fabout_dsdandfabin_dsdto send and receive data between PEs - Utilize asynchronous builtin operations on fabric DSDs
- Define a local task which is activated by a
local_task_id
Example overview
Our program will run on two processing elements (PE). We will demonstrate the program with a simulated fabric consisting of a 9 x 3 block of PEs. The program will first copyb into the left PE’s y array.
Then, it will copy the left half of A’s columns into the left PE,
and the right half of A’s columns into the right PE.
Similarly, it will copy the the first N/2 elements of x
into the left PE, and the last N/2 elements of x into the right PE.
Each PE will then compute A*x for its local pieces of A and x.
Thus, both PEs perform a matrix-vector product for an M x N/2 matrix.
The PEs will increment their local y arrays by this result.
The left PE then sends its y array to the right PE, and the right PE
increments its local y array by the received values.
Because the left y array contained the contribution from b,
the final summed y on the right PE is our GEMV result.
The host then copies y off of the right PE.
Problem Steps
Visually, this program consists of the following steps: 1. Host copies b into y array of left PE.






Writing the CSL
What do we need to modify in our layout to distribute our GEMV between two PEs?- We need to define several new parameters for our two PE programs.
This includes a
pe_id, used to differentiate between the left and right PEs, and a color, which will be used to route data between the PEs. - We need to set the color configuration on both PEs for the color
that will be used to send the left PE’s
yarray to the right PE.
layout.csl, included below.
@set_tile_code calls, one for the left PE (0, 0),
and one for the right PE (1, 0).
Both PEs take a new parameter, N_per_PE, equal to N / 2.
This is the number of columns of A that each PE will receive
and operate on.
Both PEs also receive as a parameter a pe_id: the left PE
has pe_id 0, and the right PE has pe_id 1.
We’ll see in pe_program.csl how we use pe_id to parameterize
the behavior of the program.
We also have two @set_color_config calls, to set the configuration
of send_color on each PE:
RAMP, NORTH, SOUTH, EAST, WEST.
The cardinal directions refer to the routers of neighboring PEs:
NORTH is the PE directly above our PE, and so on.
RAMP refers to the connection between our PE’s router and its compute element (CE).
When setting a route for a color on a given PE, the receive rx and transmit tx fields
are from the perspective of the router.
Thus, receiving form the RAMP means that our compute element is sending data up to the
fabric, where it can then be transmitted across the fabric.
For the left PE (0, 0), send_color will send up the PE’s RAMP to the fabric, and then
transmit data to the EAST.
For the right PE (1, 0), send_color will receive data from the WEST on the fabric
(i.e., from the left PE), and then transmit it down the RAMP to its compute element.
Now let’s take a look at our new pe_program.csl, included below.
N_per_PE, pe_id, and send_color,
we also introduce exit_task_id, our first value of type local_task_id.
We’ll talk about its use a bit later.
The A array now has size M*N_per_PE instead of M*N, since each PE
only stores half the columns.
To make our data transfer easier, we also now store A column-major instead
of row-major.
Notice that A_dsd now accesses M contiguous elements, instead of M
elements strided by the row size, since we now store column-major.
Our gemv function operates almost identically to before, except we only loop
over N_per_PE columns instead of N columns.
Since A is now column-major, @increment_dsd_offset must increment
by the length of an entire column instead of by one element.
Note that on the left PE, y already contains the elements of b before
gemv executes.
Fabric DSDs and async operations
Thecompute function, which is called from the host, first calls gemv
to compute the local contribution to y on each PE.
Then, the left PE calls send_right, while the right PE calls recv_left.
send_left defines a fabout_dsd, which is used to send wavelets to the
fabric along the color send_color.
Note that we give this fabout_dsd the extent M, since we intend to send
the M elements of y along the fabric.
The @fmovs operation copies the M elements accessed by y_dsd
into out_dsd.
The .async = true field makes this operation asynchronous.
The .activate field specifies a local_task_id to activate when this
operation completes.
When this operation completes, exit_task_id will be activated.
recv_right defines a fabin_dsd to receive the wavelets sent along
send_color.
The @fadds operation here increments the right PE’s y_dsd by the
elements received in in_dsd.
Thus, after this operation, y_dsd contains our final GEMV result.
This builtin also executes asynchronously, and actives exit_task_id
when complete.
Tasks and activatable task IDs
Now, what does activatingexit_task_id do?
In the comptime block, the @bind_local_task builtin binds exit_task_id
to the task exit_task.
When exit_task_id is activated, exit_task, which unblocks the memcpy
command stream, executes.
This task must execute on both PEs before control is returned to the host.
Writing the host code
Our new host code must:- Copy
binto the left PE’syarray - Copy the left halves of
Aandxto the left PE, and the right halves to the right PE - After the device kernel completes, copy
yback from the right PE
run.py below.
Copying b into y of left PE
We copy b into y of the left PE here:
memcpy call.
Copying A and x
We copy A and x to the device as follows:
memcpy calls copy data into
both the left and right PE.
Because we now store A column-major on the PEs, we transpose our A
matrix, and then flatten it to a 1D array with ravel().
Each PE gets M*N_per_PE elements, so each PE gets N_per_PE columms
of A.
Similarly, each PE gets N_per_PE elements of x.
Copying back result
We copy backy from the right PE as follows:
memcpy call copies back the M elements of y
only from the right PE.
Once this call is complete, we then, as in our previous tutorials,
check that the received result is correct.
Compiling and running the program
Since this program only uses two PEs, we adjust our simulated fabric dimensions accordingly:SUCCESS! message at the end of execution.

