> ## Documentation Index
> Fetch the complete documentation index at: https://sdk.cerebras.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# SdkRuntime API Reference

> Use `SdkRuntime` to load and run kernels, copy data between host and device, and manage execution on the Cerebras Wafer-Scale Engine.

## sdkruntimepybind module

Python API for [`SdkRuntime`](#sdkruntime) functions.

### MemcpyDataType

<span id="MemcpyDataType" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">MemcpyDataType</span></>} type="Bases: Enum">
  Specifies the data size for transfers using [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) and [`SdkRuntime.memcpy_h2d()`](#memcpy_h2d) copy mode.

  > **Values**:
  >
  > * MEMCPY\_16BIT
  > * MEMCPY\_32BIT
</ParamField>

### MemcpyOrder

<span id="MemcpyOrder" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">MemcpyOrder</span></>} type="Bases: Enum">
  Specifies mapping of data for transfers using [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) and [`SdkRuntime.memcpy_h2d()`](#memcpy_h2d) copy mode.

  > **Values**:
  >
  > * ROW\_MAJOR
  > * COL\_MAJOR
</ParamField>

### SdkCompileArtifacts

<span id="SdkCompileArtifacts" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">SdkCompileArtifacts</span><span className="sig-param">(artifacts_path: str)</span></>} type="Bases: object">
  Specifies compile artifacts for execution.

  <Expandable title="Parameters">
    <ParamField body="artifacts_path (str)">
      Path to compile artifacts.
    </ParamField>
  </Expandable>
</ParamField>

### SdkExecutionPlatform

<span id="SdkExecutionPlatform" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">SdkExecutionPlatform</span></>} type="Bases: object">
  Specifies the simulator or system target and architecture for execution.

  <Expandable title="content">
    <span id="is_simulation" />

    <ParamField body={<><span className="sig-name">is_simulation</span><span className="sig-param">() → bool</span></>}>
      Queries if the execution platform is a simulator.

      > * **Returns**: `True` if the execution platform is a simulator, `False` otherwise.
      > * **Return type**: `bool`
    </ParamField>

    <span id="is_system" />

    <ParamField body={<><span className="sig-name">is_system</span><span className="sig-param">() → bool</span></>}>
      Queries if the execution platform is a real system.

      > * **Returns**: `True` if the execution platform is a real system, `False` otherwise.
      > * **Return type**: `bool`
    </ParamField>
  </Expandable>
</ParamField>

### SdkRuntime

<span id="SdkRuntime" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">SdkRuntime</span><span className="sig-param">(bindir: Union[pathlib.Path, str], **kwargs)</span></>} type="Bases: object">
  Manages the execution of SDK programs on the Cerebras Wafer Scale Engine
  (WSE) or simfabric. The constructor analyzes the WSE ELFs in the `bindir`
  and prepares the WSE or simfabric for a run.
  Requires CM IP address and port for WSE runs.

  <Expandable title="content">
    <Expandable title="Parameters">
      <ParamField body="bindir (Union[pathlib.Path, str])">
        Path to ELF files which is compiled by `cslc`. The runtime collects the I/O and fabric parameters automatically, including height, width, number of channels, width of buffers, etc.
      </ParamField>
    </Expandable>

    <Expandable title="Keyword Arguments">
      <ParamField body="cmaddr (str)">
        `'IP_ADDRESS:PORT'` string of CM. Omit this `kwarg` to run on simfabric.
      </ParamField>

      <ParamField body="suppress_simfab_trace (bool)">
        If `True`, suppresses generation of `simfab_traces` when running. Default value is `False`, i.e., `simfab_traces` are produced.
        Note that producing `simfab_traces` can greatly slow down the wall clock
        time of a simulator run. If you are not using the SDK GUI with the output
        of your run, consider setting this value to `True`.
      </ParamField>

      <ParamField body="simfab_numthreads (int)">
        Number of threads to use if running on simfabric. Maximum value is `64`. Default value is `5`, i.e., the simulator uses 5 threads.
      </ParamField>

      <ParamField body="msg_level (str)">
        Message logging output level. Available output levels are `DEBUG`, `INFO`, `WARNING`, and `ERROR`. Default value is `WARNING`.
      </ParamField>

      <ParamField body="memcpy_required (bool)">
        Whether the program uses [`memcpy_h2d`](#memcpy_h2d) / [`memcpy_d2h`](#memcpy_d2h) for host-device data transfer. Default value is `True`. Set to `False` for programs that move data exclusively via [`SdkLayout`](/api-docs/sdklayout-api) streams.
      </ParamField>
    </Expandable>

    **Example**:

    In the following example, an [`SdkRuntime`](#sdkruntime) runner object is instantiated. If `args.cmaddr` is non-empty, then the kernel code will run on the WSE pointed to by that address; otherwise, the kernel code will run on simfabric. The compiled kernel code in the directory `args.name` has exported symbols `A` and `B` pointing to arrays on the device. After loading the code and starting the run with `load()` and `run()`, data on the host stored in `data` is copied to `A` on the device, and then `B` on the device is copied back into `data` on the host.

    ```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
    runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
    symbol_A = runner.get_id("A")
    symbol_B = runner.get_id("B")
    runner.load()
    runner.run()
    runner.memcpy_h2d(symbol_A, data, px, py, w, h, l,
                      streaming=False, data_type=memcpy_dtype,
                      order=memcpy_order, nonblock=False)
    runner.memcpy_d2h(data, symbol_B, px, py, w, h, l,
                      streaming=False, data_type=memcpy_dtype,
                      order=memcpy_order, nonblock=False)
    ```

    <span id="__init__" />

    <ParamField body={<><span className="sig-name">__init__</span><span className="sig-param">(bindir: Union[pathlib.Path, str], platform: <a href="#sdkexecutionplatform">SdkExecutionPlatform</a>, **kwargs) → None</span></>}>
      Constructor variant that takes a path to compiled ELF files and an execution platform specification. Takes same kwargs as above.
    </ParamField>

    <span id="__init__-1" />

    <ParamField body={<><span className="sig-name">__init__</span><span className="sig-param">(artifacts: <a href="#sdkcompileartifacts">SdkCompileArtifacts</a>, platform: <a href="#sdkexecutionplatform">SdkExecutionPlatform</a>, **kwargs) → None</span></>}>
      Constructor variant that takes a compile artifacts specification and execution platform specification. Takes same kwargs as above.
    </ParamField>

    <span id="coord_logical_to_physical" />

    <ParamField body={<><span className="sig-name">coord_logical_to_physical</span><span className="sig-param">(logical_coords: Tuple[int, int]) → Tuple[int, int]</span></>}>
      Convert a logical coordinate to a physical coordinate. For a program with fabric offsets (`offset_x`, `offset_y`), and program rectangle coordinate (`x`, `y`), this function returns (`offset_x + x`, `offset_y + y`).

      <Expandable title="Parameters">
        <ParamField body="logical_coords (Tuple[int, int])">
          Two-element tuple `(x, y)` of logical coordinates.
        </ParamField>
      </Expandable>

      > * **Returns**: Two-element tuple `(physical_x, physical_y)`.
      > * **Return type**: `Tuple[int, int]`
    </ParamField>

    <span id="dump_core" />

    <ParamField body={<><span className="sig-name">dump_core</span><span className="sig-param">(corefile: str)</span></>}>
      Dump the core of a simulator run, to be used for debugging with `csdb`. Note that the specified name of the corefile MUST be "corefile.cs1" to use with `csdb`, and this method can only be called after a blocking [`SdkRuntime`](#sdkruntime) API call, or after calling [`SdkRuntime.stop()`](#stop).

      <Expandable title="Parameters">
        <ParamField body="corefile">
          Name of corefile. Must be "corefile.cs1" to use with `csdb`.
        </ParamField>
      </Expandable>
    </ParamField>

    <span id="dump_elf_core" />

    <ParamField body={<><span className="sig-name">dump_elf_core</span><span className="sig-param">(corefile: str)</span></>}>
      Dump an ELF core of a simulator run, to be used for debugging.

      <Expandable title="Parameters">
        <ParamField body="corefile (str)">
          Name of ELF corefile.
        </ParamField>
      </Expandable>
    </ParamField>

    <span id="get_id" />

    <ParamField body={<><span className="sig-name">get_id</span><span className="sig-param">(symbol: str) → int</span></>}>
      Retrieve the integer representation of an exported symbol which is exported in the kernel. Possible symbols include a data tensor or a host-callable function.

      <Expandable title="Parameters">
        <ParamField body="symbol (str)">
          The exported name of the symbol.
        </ParamField>
      </Expandable>

      > * **Returns**: Integer representation of exported symbol.
      > * **Return type**: `int`
    </ParamField>

    <span id="get_port_id" />

    <ParamField body={<><span className="sig-name">get_port_id</span><span className="sig-param">(port_name: str) → PortId</span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Retrieve the integer representation of a program port for streaming data via [`SdkRuntime.send()`](#send) or [`SdkRuntime.receive()`](#receive).

      <Expandable title="Parameters">
        <ParamField body="port_name (str)">
          The name of the port.
        </ParamField>
      </Expandable>

      > * **Returns**: Integer representation of program data port.
      > * **Return type**: `PortId`
    </ParamField>

    <span id="is_task_done" />

    <ParamField body={<><span className="sig-name">is_task_done</span><span className="sig-param">(task_handle: <a href="#task">Task</a>) → bool</span></>}>
      Query if task `task_handle` is complete.

      <Expandable title="Parameters">
        <ParamField body={<>task_handle (<a href="#task">Task</a>)</>}>
          Handle to a task previously launched by [`SdkRuntime`](#sdkruntime).
        </ParamField>
      </Expandable>

      > * **Returns**: `True` if task is done, and `False` otherwise.
      > * **Return type**: `bool`
    </ParamField>

    <span id="call" />

    <ParamField body={<><span className="sig-name">call</span><span className="sig-param">(symbol: str, args: numpy.ndarray, **kwargs) → <a href="#task">Task</a></span></>}>
      Like [`launch`](#launch), but without type checking on the arguments. The caller is responsible for packing every argument into a single contiguous 1-D `numpy.ndarray` of `numpy.uint32`. Useful when the host already has arguments in a packed `u32` form, or to bypass the per-call type-checking overhead.

      <Expandable title="Parameters">
        <ParamField body="symbol (str)">
          The exported name of the symbol corresponding to a host-callable function.
        </ParamField>

        <ParamField body="args (numpy.ndarray)">
          1-D `uint32` array containing the packed arguments to pass to the function.
        </ParamField>
      </Expandable>

      <Expandable title="Keyword Arguments">
        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.call()`](#call).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="launch" />

    <ParamField body={<><span className="sig-name">launch</span><span className="sig-param">(symbol: str, *args, **kwargs) → <a href="#task">Task</a></span></>}>
      Trigger a host-callable function defined in the kernel, with type checking for arguments.

      <Expandable title="Parameters">
        <ParamField body="symbol (str)">
          The exported name of the symbol corresponding to a host-callable function.
        </ParamField>
      </Expandable>

      > * **Positional Arguments**: Matches the arguments of the host-callable function. [`SdkRuntime.launch()`](#launch) will perform type checking on the arguments.

      <Expandable title="Keyword Arguments">
        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.launch()`](#launch).
      > * **Return type**: [`Task`](#task)

      **Example**:

      Consider a kernel which defines a host-callable function `fn_foo` by:

      ```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
      comptime {
        @export_symbol(fn_foo);
      }
      ```

      The host calls `fn_foo` by `runner.launch("fn_foo", nonblock=False)`.
    </ParamField>

    <span id="load" />

    <ParamField body={<><span className="sig-name">load</span><span className="sig-param">()</span></>}>
      Load the binaries to simfabric or WSE. It may take 80+ seconds to load the binaries onto the WSE.
    </ParamField>

    <span id="memcpy_d2h" />

    <ParamField body={<><span className="sig-name">memcpy_d2h</span><span className="sig-param">(dest: numpy.ndarray, src: int, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Receive a host tensor from the device. The data is received from the region of interest (ROI) which is a bounding box starting at coordinate (`px`, `py`) with width `w` and height `h`.

      <Expandable title="Parameters">
        <ParamField body="dest (numpy.ndarray)">
          A 3-D host tensor `A[h][w][elem_per_pe]`, wrapped in a 1-D array according to keyword argument `order`.
        </ParamField>

        <ParamField body="src (int)">
          A user-defined color if keyword argument `streaming=True`, symbol of a device tensor otherwise.
        </ParamField>

        <ParamField body="px (int)">
          x-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="w (int)">
          Width of the ROI.
        </ParamField>

        <ParamField body="h (int)">
          Height of the ROI.
        </ParamField>

        <ParamField body="elem_per_pe (int)">
          Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has `k` elements per PE, `elem_per_pe` is `k` even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
        </ParamField>
      </Expandable>

      <Expandable title="Keyword Arguments">
        <ParamField body="streaming (bool)">
          Streaming mode if `True`, copy mode otherwise. In streaming mode, `src` is interpreted as a *color ID* and wavelets received on that color land in `dest` in arrival order. In copy mode, `src` is the symbol ID of a device tensor exported via `@export_symbol`.
        </ParamField>

        <ParamField body={<>data_type (<a href="#memcpydatatype">MemcpyDataType</a>)</>}>
          32-bit if `MemcpyDataType.MEMCPY_32BIT` or 16-bit if `MemcpyDataType.MEMCPY_16BIT`. Has no effect when `streaming=True`; in streaming mode the host receives raw 32-bit wavelets and the caller is responsible for any reinterpretation. The underlying numpy dtype of `dest` must be 32-bit-wide (e.g. `int32`/`uint32`/`float32`); 16-bit values must be packed into the low 16 bits of a 32-bit container.
        </ParamField>

        <ParamField body={<>order (<a href="#memcpyorder">MemcpyOrder</a>)</>}>
          Row-major if `MemcpyOrder.ROW_MAJOR` or column-major if `MemcpyOrder.COL_MAJOR`.
        </ParamField>

        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h).
      > * **Return type**: [`Task`](#task)

      <Expandable title="Raises">
        <ParamField body="RuntimeError">
          Raised if any of the four mandatory kwargs (`streaming`, `data_type`, `order`, `nonblock`) is omitted, or if the underlying numpy dtype of `dest` is not 32-bit-wide.
        </ParamField>
      </Expandable>
    </ParamField>

    <span id="memcpy_h2d" />

    <ParamField body={<><span className="sig-name">memcpy_h2d</span><span className="sig-param">(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Send a host tensor to the device. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate (`px`, `py`) with width `w` and height `h`.

      <Expandable title="Parameters">
        <ParamField body="dest (int)">
          A user-defined color if keyword argument `streaming=True`, symbol of a device tensor otherwise.
        </ParamField>

        <ParamField body="src (numpy.ndarray)">
          A 3-D host tensor `A[h][w][elem_per_pe]`, wrapped in a 1-D array according to parameter `order`.
        </ParamField>

        <ParamField body="px (int)">
          x-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="w (int)">
          Width of the ROI.
        </ParamField>

        <ParamField body="h (int)">
          Height of the ROI.
        </ParamField>

        <ParamField body="elem_per_pe (int)">
          Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has `k` elements per PE, `elem_per_pe` is `k` even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
        </ParamField>
      </Expandable>

      > * **Keyword Arguments**: See [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) keyword arguments.
      > * **Returns**: Handle to the task launched by [`SdkRuntime.memcpy_h2d()`](#memcpy_h2d).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="memcpy_h2d_colbcast" />

    <ParamField body={<><span className="sig-name">memcpy_h2d_colbcast</span><span className="sig-param">(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Broadcast a row of host data down columns of PEs. The data is distributed across the first row in the region of interest (ROI), which is a bounding box starting at coordinate (`px`, `py`) with width `w` and height `h`, and then broadcast down each column of the ROI.

      <Expandable title="Parameters">
        <ParamField body="dest (int)">
          A user-defined color if keyword argument `streaming=True`, symbol of a device tensor otherwise.
        </ParamField>

        <ParamField body="src (numpy.ndarray)">
          A 2-D host tensor `A[w][elem_per_pe]`, wrapped in a 1-D array according to parameter `order`.
        </ParamField>

        <ParamField body="px (int)">
          x-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="w (int)">
          Width of the ROI.
        </ParamField>

        <ParamField body="h (int)">
          Height of the ROI.
        </ParamField>

        <ParamField body="elem_per_pe (int)">
          Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has `k` elements per PE, `elem_per_pe` is `k` even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
        </ParamField>
      </Expandable>

      > * **Keyword Arguments**: See [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) keyword arguments.
      > * **Returns**: Handle to the task launched by [`SdkRuntime.memcpy_h2d_colbcast()`](#memcpy_h2d_colbcast).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="memcpy_h2d_rowbcast" />

    <ParamField body={<><span className="sig-name">memcpy_h2d_rowbcast</span><span className="sig-param">(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Broadcast a column of host data across rows of PEs. The data is distributed across the first column in the region of interest (ROI), which is a bounding box starting at coordinate (`px`, `py`) with width `w` and height `h`, and then broadcast across each row of the ROI.

      <Expandable title="Parameters">
        <ParamField body="dest (int)">
          A user-defined color if keyword argument `streaming=True`, symbol of a device tensor otherwise.
        </ParamField>

        <ParamField body="src (numpy.ndarray)">
          A 2-D host tensor `A[h][elem_per_pe]`, wrapped in a 1-D array according to parameter `order`.
        </ParamField>

        <ParamField body="px (int)">
          x-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="w (int)">
          Width of the ROI.
        </ParamField>

        <ParamField body="h (int)">
          Height of the ROI.
        </ParamField>

        <ParamField body="elem_per_pe (int)">
          Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has `k` elements per PE, `elem_per_pe` is `k` even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
        </ParamField>
      </Expandable>

      > * **Keyword Arguments**: See [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) keyword arguments.
      > * **Returns**: Handle to the task launched by [`SdkRuntime.memcpy_h2d_rowbcast()`](#memcpy_h2d_rowbcast).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="memcpy_h2d_stride" />

    <ParamField body={<><span className="sig-name">memcpy_h2d_stride</span><span className="sig-param">(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, row_stride: int, col_stride: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Send a host tensor to the device with a stride pattern across receiving PEs. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate (`px`, `py`) with width `w` and height `h`. Across a given row, `row_stride` determines the stride between receiving PEs within the ROI, and across a given column, `col_stride` determines the stride between receiving PEs.

      The first row and column to which data is sent is given by the PE (`px`, `py`) at the top-left of the ROI.

      We denote by `xi` and `eta` the number of columns and rows to which elements will be sent in the ROI, respectively. Since the ROI is `w` PEs wide and `h` PEs tall, `xi` and `eta` are given by `xi = 1 + floor((w - 1) / row_stride)` and `eta = 1 + floor((h - 1) / col_stride)`.

      As an example, consider an ROI starting at (0, 0) with width 6 and height 8, and row and column strides 3 and 2, respectively. Then PEs with x coordinate 0 or 3 and y coordinate 0, 2, 4, 6 will receive data from the host. In this case, `xi = 2` and `eta = 4`.

      <Expandable title="Parameters">
        <ParamField body="dest (int)">
          A user-defined color if keyword argument `streaming=True`, symbol of a device tensor otherwise.
        </ParamField>

        <ParamField body="src (numpy.ndarray)">
          A 3-D host tensor `A[xi][eta][elem_per_pe]`, wrapped in a 1-D array according to parameter `order`.
        </ParamField>

        <ParamField body="px (int)">
          x-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of start point of the ROI.
        </ParamField>

        <ParamField body="w (int)">
          Width of the ROI.
        </ParamField>

        <ParamField body="h (int)">
          Height of the ROI.
        </ParamField>

        <ParamField body="elem_per_pe (int)">
          Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has `k` elements per PE, `elem_per_pe` is `k` even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
        </ParamField>

        <ParamField body="row_stride (int)">
          Stride between PEs within a row in the ROI. Since the ROI is `w` PEs wide, the number of columns to which elements will be sent is `xi = 1 + floor((w - 1) / row_stride)`.
        </ParamField>

        <ParamField body="col_stride (int)">
          Stride between PEs within a column in the ROI. Since the ROI is `h` PEs tall, the number of rows to which elements will be sent is `eta = 1 + floor((h - 1) / col_stride)`.
        </ParamField>
      </Expandable>

      > * **Keyword Arguments**: See [`SdkRuntime.memcpy_d2h()`](#memcpy_d2h) keyword arguments.
      > * **Returns**: Handle to the task launched by [`SdkRuntime.memcpy_h2d_stride()`](#memcpy_h2d_stride).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="read_symbol" />

    <ParamField body={<><span className="sig-name">read_symbol</span><span className="sig-param">(x: int, y: int, symbol_name: str, dtype: str = "uint8") → numpy.ndarray</span></>}>
      Read the value of a symbol on a specific PE. This method is only supported
      in the simulator, and requires that the [`SdkRuntime`](#sdkruntime) was
      configured with [`SimfabConfig(dump_core=True)`](#simfabconfig) and that
      [`stop()`](#stop) has already been called (the core dump is flushed at
      that point).

      <Expandable title="Parameters">
        <ParamField body="x (int)">
          x-coordinate of the PE.
        </ParamField>

        <ParamField body="y (int)">
          y-coordinate of the PE.
        </ParamField>

        <ParamField body="symbol_name (str)">
          Name of the symbol to read.
        </ParamField>

        <ParamField body="dtype (str)">
          Numpy dtype string for interpreting the returned data. Default is `"uint8"`.
        </ParamField>
      </Expandable>

      > * **Returns**: Numpy array containing the symbol's data, viewed as the specified dtype.
      > * **Return type**: `numpy.ndarray`
    </ParamField>

    <span id="receive" />

    <ParamField body={<><span className="sig-name">receive</span><span className="sig-param">(port: Union[PortId, str], dest: numpy.ndarray, n_wavelets: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Receive `n_wavelets` wavelets via the program port `port` into array `dest`.

      <Expandable title="Parameters">
        <ParamField body="port (Union[PortId, str])">
          Program port from which data will be received. Can be specified by a numerical port ID or by name.
        </ParamField>

        <ParamField body="dest (numpy.ndarray)">
          Destination array into which the data will be received.
        </ParamField>

        <ParamField body="n_wavelets (int)">
          Number of wavelets to receive.
        </ParamField>
      </Expandable>

      <Expandable title="Keyword Arguments">
        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.receive()`](#receive).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="receive_tofile" />

    <ParamField body={<><span className="sig-name">receive_tofile</span><span className="sig-param">(port: Union[PortId, str], outfile: str, **kwargs) → <a href="#task">Task</a></span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Receive data via the program port `port` and write to a file named `outfile`.

      <Expandable title="Parameters">
        <ParamField body="port (Union[PortId, str])">
          Program port from which data will be received. Can be specified by a numerical port ID or by name.
        </ParamField>

        <ParamField body="outfile (str)">
          Name of file to which received output is written.
        </ParamField>
      </Expandable>

      <Expandable title="Keyword Arguments">
        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.receive_tofile()`](#receive_tofile).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="report_port_infos" />

    <ParamField body={<><span className="sig-name">report_port_infos</span><span className="sig-param">()</span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Reports the port name, color and absolute coordinate of every program data port.
    </ParamField>

    <span id="run" />

    <ParamField body={<><span className="sig-name">run</span><span className="sig-param">()</span></>}>
      Start the simfabric or WSE run and wait for commands from the host runtime.
    </ParamField>

    <span id="send" />

    <ParamField body={<><span className="sig-name">send</span><span className="sig-param">(port: Union[PortId, str], src: numpy.ndarray, n_wavelets: int, **kwargs) → <a href="#task">Task</a></span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Stream `n_wavelets` wavelets from `src` to the device via the port `port`.

      <Expandable title="Parameters">
        <ParamField body="port (Union[PortId, str])">
          Target program port in which to stream data. Can be specified by a numerical port ID or by name.
        </ParamField>

        <ParamField body="src (numpy.ndarray)">
          Input source array whose contents will be streamed to the device.
        </ParamField>

        <ParamField body="n_wavelets (int)">
          Number of wavelets to send.
        </ParamField>
      </Expandable>

      <Expandable title="Keyword Arguments">
        <ParamField body="nonblock (bool)">
          Nonblocking if `True`, blocking otherwise.
        </ParamField>
      </Expandable>

      > * **Returns**: Handle to the task launched by [`SdkRuntime.send()`](#send).
      > * **Return type**: [`Task`](#task)
    </ParamField>

    <span id="send-1" />

    <ParamField body={<><span className="sig-name">send</span><span className="sig-param">(port: Union[PortId, str], src: numpy.ndarray, **kwargs) → <a href="#task">Task</a></span></>}>
      Part of the [`SdkRuntime`](#sdkruntime) direct link API.

      Same as above when `src.dtype` is exactly `np.int32`, `np.uint32` or `np.float32`. In that case, the runtime infers `n_wavelets` from `len(src)`.
    </ParamField>

    <span id="stop" />

    <ParamField body={<><span className="sig-name">stop</span><span className="sig-param">()</span></>}>
      Wait for all pending commands (data transfers and kernel function calls) to complete and then stop simfabric or WSE. After this call is complete, no new commands will be accepted for this [`SdkRuntime`](#sdkruntime) object.

      Nonblocking D2H destination buffers are fully populated before `stop()` returns. If the simulator was constructed with `dump_core=True`, the core dump is flushed at this point; calls to [`read_symbol()`](#read_symbol) must therefore occur after `stop()` rather than before.

      [`SdkRuntime.stop()`](#stop) must be called to end a program. Otherwise, the runtime will emit an error.
    </ParamField>

    <span id="task_wait" />

    <ParamField body={<><span className="sig-name">task_wait</span><span className="sig-param">(task_handle: <a href="#task">Task</a>)</span></>}>
      Wait for the task `task_handle` to complete.

      <Expandable title="Parameters">
        <ParamField body={<>task_handle (<a href="#task">Task</a>)</>}>
          Handle to a task previously launched by [`SdkRuntime`](#sdkruntime).
        </ParamField>
      </Expandable>
    </ParamField>
  </Expandable>
</ParamField>

### SdkTarget

<span id="SdkTarget" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">SdkTarget</span></>} type="Bases: Enum">
  Specifies a target compilation architecture.

  > **Values**:
  >
  > * WSE2
  > * WSE3
</ParamField>

### SimfabConfig

<span id="SimfabConfig" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">SimfabConfig</span><span className="sig-param">(num_threads: int = 16, suppress_trace: bool = False, dump_core: bool = False, core_path: Optional[Union[pathlib.Path, str]] = None)</span></>} type="Bases: object">
  Specifies simfab configuration for simulator runs.

  <Expandable title="Parameters">
    <ParamField body="num_threads (int)">
      Number of CPU threads used by the simulator. Default is 16; maximum is 64. (Note: the legacy `SdkRuntime(bindir, simfab_numthreads=N)` constructor instead defaults to 5; `SimfabConfig` itself defaults to 16.)
    </ParamField>

    <ParamField body="suppress_trace (bool)">
      If `True`, suppresses generation of `simfab_traces` when running.
    </ParamField>

    <ParamField body="dump_core (bool)">
      If `True`, produces a coredump after execution ends.
    </ParamField>

    <ParamField body="core_path (Union[pathlib.Path, str, None])">
      Name of produced coredump. `None` (default) is `out.core`.
    </ParamField>
  </Expandable>
</ParamField>

### Task

<span id="Task" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">Task</span></>}>
  Handle to a task launched by [`SdkRuntime`](#sdkruntime).
</ParamField>

### get\_platform

<span id="get_platform" />

<ParamField body={<>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">get_platform</span><span className="sig-param">(cmaddr: Optional[str] = None, config: <a href="#simfabconfig">SimfabConfig</a> = SimfabConfig(), target: <a href="#sdktarget">SdkTarget</a> = SdkTarget::WSE3) → <a href="#sdkexecutionplatform">SdkExecutionPlatform</a></span></>}>
  Constructs an [`SdkExecutionPlatform`](#sdkexecutionplatform) object configured by simulator or system settings and target architecture.

  <Expandable title="Parameters">
    <ParamField body="cmaddr (Union[str, None])">
      CM address in `"IP_ADDRESS:PORT"` format. `None` (default) or the empty string chooses the simulator.
    </ParamField>

    <ParamField body={<>config (<a href="#simfabconfig">SimfabConfig</a>)</>}>
      Simulator configuration object. Ignored when `cmaddr` is provided.
    </ParamField>

    <ParamField body={<>target (<a href="#sdktarget">SdkTarget</a>)</>}>
      Target architecture for the simulator or system.
    </ParamField>
  </Expandable>

  > * **Returns**: A configured execution platform object.
  > * **Return type**: [`SdkExecutionPlatform`](#sdkexecutionplatform)
</ParamField>

### get\_simulator

<span id="get_simulator" />

<ParamField body={<>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">get_simulator</span><span className="sig-param">(config: <a href="#simfabconfig">SimfabConfig</a> = SimfabConfig(), target: <a href="#sdktarget">SdkTarget</a> = SdkTarget::WSE3) → <a href="#sdkexecutionplatform">SdkExecutionPlatform</a></span></>}>
  Constructs an [`SdkExecutionPlatform`](#sdkexecutionplatform) object for simulator.

  <Expandable title="Parameters">
    <ParamField body={<>config (<a href="#simfabconfig">SimfabConfig</a>)</>}>
      Simulator configuration object.
    </ParamField>

    <ParamField body={<>target (<a href="#sdktarget">SdkTarget</a>)</>}>
      Target architecture for the simulator.
    </ParamField>
  </Expandable>

  > * **Returns**: A configured execution platform object.
  > * **Return type**: [`SdkExecutionPlatform`](#sdkexecutionplatform)
</ParamField>

### get\_system

<span id="get_system" />

<ParamField body={<>cerebras.sdk.runtime.sdkruntimepybind.<span className="sig-name">get_system</span><span className="sig-param">(cmaddr: str) → <a href="#sdkexecutionplatform">SdkExecutionPlatform</a></span></>}>
  Constructs an [`SdkExecutionPlatform`](#sdkexecutionplatform) object for a real system.

  <Expandable title="Parameters">
    <ParamField body="cmaddr (str)">
      CM address in `"IP_ADDRESS:PORT"` format.
    </ParamField>
  </Expandable>

  > * **Returns**: A configured execution platform object.
  > * **Return type**: [`SdkExecutionPlatform`](#sdkexecutionplatform)
</ParamField>

## sdk\_utils module

Utility functions for common operations with [`SdkRuntime`](#sdkruntime).
Import from `cerebras.sdk.sdk_utils`.

### calculate\_cycles

<span id="calculate_cycles" />

<ParamField body={<>cerebras.sdk.sdk_utils.<span className="sig-name">calculate_cycles</span><span className="sig-param">(timestamp_buf: numpy.ndarray) → numpy.int64</span></>}>
  Converts values in `timestamp_buf` returned from device into a human-readable elapsed cycle count.

  <Expandable title="Parameters">
    <ParamField body="timestamp_buf (numpy.ndarray)">
      Array returned from device containing elapsed timestamp data.
    </ParamField>
  </Expandable>

  > * **Returns**: Elapsed cycle count.
  > * **Return type**: `numpy.int64`

  **Example**:

  Consider the following CSL snippet which records timestamps and produces a single array to copy back to the host, to generate an elapsed cycle count:

  ```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
  // import time module and create timestamp buffers
  const timestamp = @import_module("<time>");
  var tsc_end_buf = @zeros([timestamp.tsc_size_words]u16);
  var tsc_start_buf = @zeros([timestamp.tsc_size_words]u16);

  // create elapsed timer buffer and advertise to host
  var timer_buf = @zeros([3]f32);
  var ptr_timer_buf: [*]f32 = &timer_buf;

  timestamp.enable_tsc();
  // record starting timestamp
  timestamp.get_timestamp(&tsc_start_buf);

  // perform some operation for which you want to calculate elapsed cycles

  // record ending timestamp
  timestamp.get_timestamp(&tsc_end_buf);
  timestamp.disable_tsc();

  var lo_: u16 = 0;
  var hi_: u16 = 0;
  var word: u32 = 0;

  lo_ = tsc_start_buf[0];
  hi_ = tsc_start_buf[1];
  timer_buf[0] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

  lo_ = tsc_start_buf[2];
  hi_ = tsc_end_buf[0];
  timer_buf[1] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

  lo_ = tsc_end_buf[1];
  hi_ = tsc_end_buf[2];
  timer_buf[2] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
  ```

  Then the elapsed cycles can be calculated on the host with:

  ```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
  # Get symbol for timer_buf on device
  symbol_timer_buf = runner.get_id("timer_buf")

  # Copy back timer_buf from all width x height PEs
  data = np.zeros((width*height*3, 1), dtype=np.uint32)
  runner.memcpy_d2h(data, symbol_timer_buf, 0, 0, width, height, 3, streaming=False,
    data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
  elapsed_time_hwl = data.view(np.float32).reshape((height, width, 3))

  # Print elapsed cycles for each PE
  for pe_x in range(width):
    for pe_y in range(height):
      cycle_cnt = sdk_utils.calculate_cycles(elapsed_time_hwl[pe_y, pe_x, :])
      print("Elapsed cycles on PE ", pe_x, ", ", pe_y, ": ", cycle_cnt)
  ```
</ParamField>

### input\_array\_to\_u32

<span id="input_array_to_u32" />

<ParamField body={<>cerebras.sdk.sdk_utils.<span className="sig-name">input_array_to_u32</span><span className="sig-param">(arr: numpy.ndarray, sentinel: Optional[int], fast_dim_sz: int) → numpy.ndarray</span></>}>
  Converts a 16-bit tensor to a 32-bit tensor of type `u32` for use with `memcpy`. The parameter `sentinel` distinguishes two different extensions of 16-bit data. If `sentinel` is `None`, zero-pad the upper 16 bits. If `sentinel` is not `None`, pack the index of the innermost dimension of the array into the upper 16-bits.

  <Expandable title="Parameters">
    <ParamField body="arr (numpy.ndarray)">
      A flat (1-D) numpy array with 2 or 4 bytes per element. If your data is multi-dimensional, flatten it first (for example, with `arr.ravel()`).
    </ParamField>

    <ParamField body="sentinel (Optional[int])">
      For 16-bit input data, if this parameter is not `None`, pack the index of the innermost dimension into the high bits of the 32-bit wavelet. If sentinel is `None`, then the high bits are zeros.
    </ParamField>

    <ParamField body="fast_dim_sz (int)">
      If `sentinel` is not `None`, specifies size of fastest-changing dimension for generating the index.
    </ParamField>
  </Expandable>

  > * **Returns**: Numpy view into `arr` with specified numpy data type.
  > * **Return type**: `numpy.ndarray.view`
</ParamField>

### memcpy\_view

<span id="memcpy_view" />

<ParamField body={<>cerebras.sdk.sdk_utils.<span className="sig-name">memcpy_view</span><span className="sig-param">(arr: numpy.ndarray, datatype: numpy.dtype) → numpy.ndarray.view</span></>}>
  Returns a 32, 16 or 8 bit view of a 32 bit numpy array (only the lower 16 or 8 bits of each 32 bit word in the last two cases).

  <Expandable title="Parameters">
    <ParamField body="arr (numpy.ndarray)">
      A numpy array with 4 bytes per element on which the numpy view will be created.
    </ParamField>

    <ParamField body="datatype (numpy.dtype)">
      The numpy data type which should be used in the output view. The itemsize must be 1, 2, or 4 bytes.
    </ParamField>
  </Expandable>

  > * **Returns**: Numpy view into `arr` with specified numpy data type.
  > * **Return type**: `numpy.ndarray.view`

  **Example**:

  [`memcpy_view()`](#memcpy_view) simplifies the use of various precision data types when copying between host and device. Consider the following Python host code which creates a `float16` view into a numpy array. Note that this array *must* be 32-bit. The user can fill the array with `float16` data, and copy it to an array on the device with CSL data type `f16`.

  ```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
  x_symbol = runner.get_symbol('x')
  # This container array must be 32-bit
  x_container = np.zeros(N, dtype=np.uint32)

  x = sdk_utils.memcpy_view(x_container, np.float16)
  x.fill(0.5)

  runner.memcpy_h2d(x_symbol, x_container, 0, 0, 1, 1, N,
              streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT,
              order=MemcpyOrder.ROW_MAJOR, nonblock=False)
  ```
</ParamField>

## debug\_util module

Utilities for parsing debug output and core files of a simulator run.
Import from `cerebras.sdk.debug.debug_util`.

### debug\_util

<span id="debug_util" />

<ParamField body={<><span className="sig-kw">class </span>cerebras.sdk.debug.debug_util.<span className="sig-name">debug_util</span><span className="sig-param">(bindir: Union[pathlib.Path, str])</span></>} type="Bases: object">
  Loads ELF files in `bindir` in order to dump symbols for debugging.

  The user does not need to export the symbols in the kernel. [`debug_util`](#debug_util) dumps the core and looks for the symbols in the ELFs. If the symbol at `Px.y` is not found in the corresponding ELF, [`debug_util`](#debug_util) emits an error.

  The most common errors are either: 1) a wrong coordinate passed in [`debug_util.get_symbol()`](#get_symbol), or 2) a correct coordinate, but the symbol has been removed due to compiler optimization. One can use `readelf` to check if the symbol exists or not. If not, the user can export the symbol in the kernel to keep the symbol in the ELF.

  The functionality of this class is only supported in the simulator.

  <Expandable title="content">
    <Expandable title="Parameters">
      <ParamField body="bindir (Union[pathlib.Path, str])">
        Path to ELF files.
      </ParamField>
    </Expandable>

    **Example**:

    ```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
    from cerebras.sdk.debug.debug_util import debug_util

    # run the app
    # dirname is the path to ELFs
    simulator = SdkRuntime(dirname)
    simulator.load()
    simulator.run()
    ...
    simulator.stop()

    # retrieve symbols after the run
    debug_mod = debug_util(dirname)
    # assume the core rectangle starts at P4.1, the dimension is
    # width-by-height and we want to retrieve the symbol y for every PE
    core_offset_x = 4
    core_offset_y = 1
    for py in range(height):
      for px in range(width):
        t = debug_mod.get_symbol(core_offset_x+px, core_offset_y+py, 'y', np.float32)
        print(f"At (py, px) = {py, px}, symbol y = {t}")
    ```

    <span id="get_symbol" />

    <ParamField body={<><span className="sig-name">get_symbol</span><span className="sig-param">(col: int, row: int, symbol: str, dtype: numpy.dtype) → numpy.ndarray</span></>}>
      Read the value of `symbol` of given type at given PE coordinates. Note that each call to this function scans the whole fabric, so prefer [`debug_util.get_symbol_rect()`](#get_symbol_rect) over calling this in a loop.

      <Expandable title="Parameters">
        <ParamField body="px (int)">
          x-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle).
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle).
        </ParamField>

        <ParamField body="symbol (str)">
          Name of the symbol to be read.
        </ParamField>

        <ParamField body="dtype (numpy.dtype)">
          Numpy data type of values contained by symbol.
        </ParamField>
      </Expandable>

      > * **Returns**: Numpy array of output values read at symbol.
      > * **Return type**: `numpy.ndarray`
    </ParamField>

    <span id="get_symbol_rect" />

    <ParamField body={<><span className="sig-name">get_symbol_rect</span><span className="sig-param">(rectangle: Tuple[Tuple[int, int], Tuple[int, int]], symbol: str, dtype: numpy.dtype) → numpy.ndarray</span></>}>
      Read the value of `symbol` of given type for a rectangle of PEs.

      <Expandable title="Parameters">
        <ParamField body="rectangle (Tuple[Tuple[int, int], Tuple[int, int]])">
          Rectangle specified as `((col, row), (width, height))`, indexed from the northwest corner of the entire fabric (NOT the program rectangle).
        </ParamField>

        <ParamField body="symbol (str)">
          Name of the symbol to be read.
        </ParamField>

        <ParamField body="dtype (numpy.dtype)">
          Numpy data type of values contained by symbol.
        </ParamField>
      </Expandable>

      > * **Returns**: Numpy array of output values read at symbol. The first two dimensions of the returned array are PE coordinates `(column, row)` relative to the rectangle.
      > * **Return type**: `numpy.ndarray`
    </ParamField>

    <span id="read_trace" />

    <ParamField body={<><span className="sig-name">read_trace</span><span className="sig-param">(px: int, py: int, name: str) → list</span></>}>
      Parse a CSL trace buffer with name `name` at the given PE coordinates.

      <Expandable title="Parameters">
        <ParamField body="px (int)">
          x-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle).
        </ParamField>

        <ParamField body="py (int)">
          y-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle).
        </ParamField>

        <ParamField body="name (str)">
          Name of the trace buffer to be read.
        </ParamField>
      </Expandable>

      > * **Returns**: Heterogeneous list of trace values.
      > * **Return type**: `list`

      **Example**:

      Consider a device kernel which initializes a trace buffer with the CSL `debug` library and uses it to record values:

      ```csl theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
      const debug_mod = @import_module("<debug>", .{.key = "my_trace", .buffer_size = 100});

      fn foo() void {
        debug_mod.trace_timestamp();
        debug_mod.trace_string("Bar");
        debug_mod.trace_i16(1);
      }
      ```

      Then the trace can be read in the host code with:

      ```python theme={"languages":{"custom":["/languages/csl-tmlanguage.json"]}}
      trace_output = debug_mod.read_trace(4, 1, 'my_trace')
      print(trace_output)
      ```

      If `foo` was executed only once, then `trace_output` will be a heterogeneous list containing a timestamp, the string "Bar", and the number 1.
    </ParamField>
  </Expandable>
</ParamField>
