> ## Documentation Index
> Fetch the complete documentation index at: https://sdk.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GEMV Tutorial 6: Routes and Fabric DSDs

> Use fabric DSDs and routes to distribute a GEMV computation across multiple PEs and pass data between them via the WSE fabric.

Now that we’ve introduced multiple PEs into our program,
instead of duplicating our GEMV problem between them,
let’s actually distribute the work for computing a single GEMV.

## Learning Objectives

After completing this tutorial, you should know how to:

* Use fabric DSDs `fabout_dsd` and `fabin_dsd` to send and receive data between PEs
* Utilize asynchronous builtin operations on fabric DSDs
* Define a local task which is activated by a `local_task_id`

## Example Overview

Our program will run on two processing elements (PE).

We will demonstrate the program with a simulated fabric
consisting of a 9 x 3 block of PEs.

The program will first copy `b` into the left PE’s `y` array.
Then, it will copy the left half of `A`’s columns into the left PE,
and the right half of `A`’s columns into the right PE.
Similarly, it will copy the first `N/2` elements of `x`
into the left PE, and the last `N/2` elements of `x` into the right PE.

Each PE will then compute `A*x` for its local pieces of `A` and `x`.
Thus, both PEs perform a matrix-vector product for an `M x N/2` matrix.
The PEs will increment their local `y` arrays by this result.

The left PE then sends its `y` array to the right PE, and the right PE
increments its local `y` array by the received values.
Because the left `y` array contained the contribution from `b`,
the final summed `y` on the right PE is our GEMV result.

The host then copies `y` off of the right PE.

### Problem Steps

Visually, this program consists of the following steps:

**1. Host copies b into y array of left PE.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_1.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=78d6068eb855ba5c596b0de427632d06" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_1.png" />
</Frame>

**2. Host copies left N/2 columns of A to left PE, right N/2 columns to right PE.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_2.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=a4fc7f5c6bddadaffafe44c9f26d480f" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_2.png" />
</Frame>

**3. Host copies first N/2 elements of x to left PE, last N/2 elements to right PE.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_3.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=ea29265793d15529aff40452ecb372e0" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_3.png" />
</Frame>

**4. Host launches function to compute GEMV.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_4.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=f91ebd02241d3489a15d43377cf1f9a6" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_4.png" />
</Frame>

**5. Each PE increments local y by local portion of matrix-vector product Ax.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_5.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=8971a8b09587ba6e5e70aaa54af9b612" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_5.png" />
</Frame>

**6. Left PE sends local y to right PE, and right PE increments y by received values.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_6.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=660091c159cf2efbd7dbb538229a49b9" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_6.png" />
</Frame>

**7. Right PE now contains final result y. Host copies back y from right PE.**

<Frame>
  <img src="https://mintcdn.com/cluster-docs/elz9acWFq5HQwBOV/images/tutorial_gemv_6_7.png?fit=max&auto=format&n=elz9acWFq5HQwBOV&q=85&s=7b5da6e68c84fb652e5cf9400974db97" alt="" width="4230" height="1269" data-path="images/tutorial_gemv_6_7.png" />
</Frame>

## Writing the CSL

What do we need to modify in our layout to distribute our GEMV
between two PEs?

1. We need to define several new parameters for our two PE programs.
   This includes a `pe_id`, used to differentiate between the
   left and right PEs, and a color, which will be used to route
   data between the PEs.
2. We need to set the color configuration on both PEs for the color
   that will be used to send the left PE’s `y` array to the
   right PE.

Let’s take a look at our new `layout.csl`, included below.

```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
// matrix dimensions on each PE
param M: i16;
param N: i16;

// Colors
const send_color: color = @get_color(0); // Color used to send/recv data between PEs

// This example only uses 2 PEs
const memcpy = @import_module("<memcpy/get_params>", .{
  .width = 2,
  .height = 1,
});

layout {
  // PE coordinates are (column, row)
  @set_rectangle(2, 1);

  // Left PE (0, 0)
  @set_tile_code(0, 0, "pe_program.csl", .{
    .memcpy_params = memcpy.get_params(0),
    .M = M,
    .N_per_PE = N / 2,
    .pe_id = 0,
    .send_color = send_color
  });

  // Left PE sends its result to the right
  @set_color_config(0, 0, send_color, .{.routes = .{ .rx = .{RAMP}, .tx = .{EAST} }});

  // Right PE (1, 0)
  @set_tile_code(1, 0, "pe_program.csl", .{
    .memcpy_params = memcpy.get_params(1),
    .M = M,
    .N_per_PE = N / 2,
    .pe_id = 1,
    .send_color = send_color
  });

  // Right PE receives result of left PE
  @set_color_config(1, 0, send_color, .{.routes = .{ .rx = .{WEST}, .tx = .{RAMP} }});

  // export symbol names
  @export_name("A", [*]f32, true);
  @export_name("x", [*]f32, true);
  @export_name("y", [*]f32, true);
  @export_name("compute", fn()void);
}
```

We have two `@set_tile_code` calls, one for the left PE (0, 0),
and one for the right PE (1, 0).
Both PEs take a new parameter, `N_per_PE`, equal to `N / 2`.
This is the number of columns of `A` that each PE will receive
and operate on.
Both PEs also receive as a parameter a `pe_id`: the left PE
has `pe_id` 0, and the right PE has `pe_id` 1.
We’ll see in `pe_program.csl` how we use `pe_id` to parameterize
the behavior of the program.

We also have two `@set_color_config` calls, to set the configuration
of `send_color` on each PE:

```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
@set_color_config(0, 0, send_color, .{.routes = .{ .rx = .{RAMP}, .tx = .{EAST} }});
...
@set_color_config(1, 0, send_color, .{.routes = .{ .rx = .{WEST}, .tx = .{RAMP} }});
```

The router of each PE has five directions: `RAMP`, `NORTH`, `SOUTH`, `EAST`, `WEST`.
The cardinal directions refer to the routers of neighboring PEs:
`NORTH` is the PE directly above our PE, and so on.
`RAMP` refers to the connection between our PE’s router and its compute element (CE).
When setting a route for a color on a given PE, the receive `rx` and transmit `tx` fields
are from the perspective of the router.
Thus, receiving form the `RAMP` means that our compute element is sending data up to the
fabric, where it can then be transmitted across the fabric.

For the left PE (0, 0), `send_color` will send up the PE’s `RAMP` to the fabric, and then
transmit data to the `EAST`.
For the right PE (1, 0), `send_color` will receive data from the `WEST` on the fabric
(i.e., from the left PE), and then transmit it down the `RAMP` to its compute element.

Now let’s take a look at our new `pe_program.csl`, included below.

```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
param memcpy_params;

// Matrix dimensions
param M: i16;
param N_per_PE: i16;

// ID of PE (0 is left, 1 is right)
param pe_id: i16;

// Colors
param send_color: color; // Color used to send/recv data between PEs

// Queue IDs
const send_color_oq = @get_output_queue(2);
const send_color_iq = @get_input_queue(2);

// Task ID used by a local task to unblock cmd stream
const exit_task_id: local_task_id = @get_local_task_id(9);

// memcpy module provides infrastructure for copying data
// and launching functions from the host
const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);

// 48 kB of global memory contain A, x, y
var A: [M*N_per_PE]f32; // A is stored column major
var x: [N_per_PE]f32;
var y: [M]f32;

// DSDs for accessing A, b, y
// A_dsd accesses column of A
var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M });
var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });

// ptrs to A, x, b, y will be advertised as symbols to host
var A_ptr: [*]f32 = &A;
var x_ptr: [*]f32 = &x;
var y_ptr: [*]f32 = &y;

// Compute gemv
fn gemv() void {
  // Loop over all columns of A
  for (@range(i16, N_per_PE)) |i| {
    // Calculate contribution to A*x from ith column of A, ith elem of x
    @fmacs(y_dsd, y_dsd, A_dsd, x[i]);
    // Move A_dsd to next column of A
    A_dsd = @increment_dsd_offset(A_dsd, M, f32);
  }
}

fn send_right() void {
  const out_dsd = @get_dsd(fabout_dsd, if (@is_arch("wse3")) .{
                    .extent = M, .output_queue = send_color_oq
                  } else .{
                    .fabric_color = send_color, .extent = M,
                    .output_queue = send_color_oq
                  });
  // After fmovs is done, activate exit_task to unblock cmd_stream
  @fmovs(out_dsd, y_dsd, .{ .async = true, .activate = exit_task_id });
}

fn recv_left() void {
  const in_dsd = @get_dsd(fabin_dsd, .{
                   .extent = M,
                   .input_queue = send_color_iq
                 });
  // After fadds is done, activate exit_task to unblock cmd stream
  @fadds(y_dsd, y_dsd, in_dsd, .{ .async = true, .activate = exit_task_id });
}

// Call gemv function and send/ receive partial result y
fn compute() void {
  gemv();
  if (pe_id == 0) {
    send_right();
  } else {
    recv_left();
  }
}

task exit_task() void {
  sys_mod.unblock_cmd_stream();
}

comptime {
  // When exit_task_id is activated, exit_task will execute
  @bind_local_task(exit_task, exit_task_id);

  // On WSE-3, we must explicitly bind the output queue to a color.
  // Input queues must be bound to a color on both architectures.
  @initialize_queue(send_color_oq, if (@is_arch("wse3")) .{ .color = send_color } else .{});
  @initialize_queue(send_color_iq, .{ .color = send_color });

  @export_symbol(A_ptr, "A");
  @export_symbol(x_ptr, "x");
  @export_symbol(y_ptr, "y");
  @export_symbol(compute);
}
```

In addition to our new parameters `N_per_PE`, `pe_id`, and `send_color`,
we also introduce `exit_task_id`, our first value of type `local_task_id`.
We’ll talk about its use a bit later.

The `A` array now has size `M*N_per_PE` instead of `M*N`, since each PE
only stores half the columns.
To make our data transfer easier, we also now store `A` column-major instead
of row-major.
Notice that `A_dsd` now accesses `M` contiguous elements, instead of `M`
elements strided by the row size, since we now store column-major.

Our `gemv` function operates almost identically to before, except we only loop
over `N_per_PE` columns instead of `N` columns.
Since `A` is now column-major, `@increment_dsd_offset` must increment
by the length of an entire column instead of by one element.
Note that on the left PE, `y` already contains the elements of `b` before
`gemv` executes.

### Fabric DSDs and Async Operations

The `compute` function, which is called from the host, first calls `gemv`
to compute the local contribution to `y` on each PE.
Then, the left PE calls `send_right`, while the right PE calls `recv_left`.

`send_right` defines a `fabout_dsd`, which is used to send wavelets to the
fabric along the color `send_color`.
Note that we give this `fabout_dsd` the extent `M`, since we intend to send
the `M` elements of `y` along the fabric.
On WSE-3, a fabric DSD is bound to a color indirectly via its queue, so
`.fabric_color` is omitted from the WSE-3 form; on WSE-2 the color is
specified directly on the DSD.
The `@fmovs` operation copies the `M` elements accessed by `y_dsd`
into `out_dsd`.
The `.async = true` field makes this operation asynchronous.
The `.activate` field specifies a `local_task_id` to activate when this
operation completes.
When this operation completes, `exit_task_id` will be activated.

`recv_left` defines a `fabin_dsd` to receive the wavelets sent along
`send_color`.
The `@fadds` operation here increments the right PE’s `y_dsd` by the
elements received in `in_dsd`.
Thus, after this operation, `y_dsd` contains our final GEMV result.
This builtin also executes asynchronously, and actives `exit_task_id`
when complete.

<Warning>
  **Warning** <br />
  Whenever using fabric DSDs in builtin operations, always make these operations
  execute asynchronously.
  Using fabric DSDs synchronously can result in poor performance or deadlocks.
</Warning>

### Tasks and Activatable Task IDs

Now, what does activating `exit_task_id` do?
In the comptime block, the `@bind_local_task` builtin binds `exit_task_id`
to the task `exit_task`.
When `exit_task_id` is activated, `exit_task`, which unblocks the `memcpy`
command stream, executes.
This task must execute on both PEs before control is returned to the host.

## Writing the Host Code

Our new host code must:

1. Copy `b` into the left PE’s `y` array
2. Copy the left halves of `A` and `x` to the left PE, and the right halves to the right PE
3. After the device kernel completes, copy `y` back from the right PE

We explain some features of our new `run.py` below.

```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
#!/usr/bin/env cs_python

import argparse
import json
import numpy as np

from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder # pylint: disable=no-name-in-module

# Read arguments
parser = argparse.ArgumentParser()
parser.add_argument('--name', help="the test compile output dir")
parser.add_argument('--cmaddr', help="IP:port for CS system")
args = parser.parse_args()

# Get matrix dimensions from compile metadata
with open(f"{args.name}/out.json", encoding='utf-8') as json_file:
  compile_data = json.load(json_file)

# Matrix dimensions
N = int(compile_data['params']['N'])
M = int(compile_data['params']['M'])

# Construct A, x, b
A = np.arange(M*N, dtype=np.float32).reshape(M,N)
x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

# Calculate expected y
y_expected = A@x + b

# Size of N dimension on each PE
N_per_PE = N // 2

# Construct a runner using SdkRuntime
runner = SdkRuntime(args.name, cmaddr=args.cmaddr)

# Get symbols for A, x, y on device
A_symbol = runner.get_id('A')
x_symbol = runner.get_id('x')
y_symbol = runner.get_id('y')

# Load and run the program
runner.load()
runner.run()

# Copy b into y of PE (0, 0)
runner.memcpy_h2d(y_symbol, b, 0, 0, 1, 1, M, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Copy A in column major format
# PE (0, 0) gets first N/2 columns; PE (1, 0) gets last N/2 columns
runner.memcpy_h2d(A_symbol, A.transpose().ravel(), 0, 0, 2, 1, M*N_per_PE, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# PE (0, 0) gets first N/2 elements; PE (1, 0) gets last N/2 elements
runner.memcpy_h2d(x_symbol, x, 0, 0, 2, 1, N_per_PE, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Launch the compute function on device
runner.launch('compute', nonblock=False)

# Copy y back from PE (1, 0)
y_result = np.zeros([M], dtype=np.float32)
runner.memcpy_d2h(y_result, y_symbol, 1, 0, 1, 1, M, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Stop the program
runner.stop()

# Ensure that the result matches our expectation
np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
print("SUCCESS!")
```

### Copying `b` into `y` of left PE

We copy `b` into `y` of the left PE here:

```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
runner.memcpy_h2d(y_symbol, b, 0, 0, 1, 1, M, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
```

Notice that the ROI is a single PE, located at (0, 0) in the program rectangle.
The right PE (1, 0) is omitted from this `memcpy` call.

### Copying `A` and `x`

We copy `A` and `x` to the device as follows:

```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
runner.memcpy_h2d(A_symbol, A.transpose().ravel(), 0, 0, 2, 1, M*N_per_PE, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

runner.memcpy_h2d(x_symbol, x, 0, 0, 2, 1, N_per_PE, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
```

Notice that the ROI is now both PEs, so the `memcpy` calls copy data into
both the left and right PE.

Because we now store `A` column-major on the PEs, we transpose our `A`
matrix, and then flatten it to a 1D array with `ravel()`.
Each PE gets `M*N_per_PE` elements, so each PE gets `N_per_PE` columms
of `A`.

Similarly, each PE gets `N_per_PE` elements of `x`.

### Copying Back Result

We copy back `y` from the right PE as follows:

```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
y_result = np.zeros([M], dtype=np.float32)
runner.memcpy_d2h(y_result, y_symbol, 1, 0, 1, 1, M, streaming=False,
  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
```

Notice that our ROI now begins at (1, 0), and contains a single PE.
Thus, this `memcpy` call copies back the `M` elements of `y`
only from the right PE.

Once this call is complete, we then, as in our previous tutorials,
check that the received result is correct.

## Compiling and Running the Program

Since this program only uses two PEs, we adjust our
simulated fabric dimensions accordingly:

```bash theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
$ cslc layout.csl --fabric-dims=9,3 --fabric-offsets=4,1 --params=M:4,N:6 --memcpy --channels=1 -o out
$ cs_python run.py --name out
```

We use the same command to run.
You should see a `SUCCESS!` message at the end of execution.

## Exercises

Instead of using two PEs along the same row to compute this GEMV,
try using two PEs along the same column.

## Next

Stay tuned for more tutorials!
