<- Back to blog
Engineering

zymtrace AI Flamegraph: Why We Ditched TypeScript for Rust and WebAssembly

Timo von Hartz

{Readtime} mins read
May 16, 2025
In this article
Discuss on Discord
Share

The zymtrace AI flamegraph renderer was built from the ground up in Rust and WebAssembly (WASM), replacing 19k lines of TypeScript with 3k lines of Rust and achieving over faster rendering—eliminating lag and stutter completely.

This blog describes how we built the AI flamegraph, why we chose Rust, WASM, and the key techniques that made it fast, efficient, and scalable.

Background

When we launched zymtrace back in December 2024, we prioritized speed over perfection, taking shortcuts to ship quickly and gather feedback. That worked for a while, but as more customers started profiling diverse workloads, issues with our flamegraph and sandwich views became apparent. The tool meant to help diagnose performance bottlenecks was itself struggling. It was frustrating — we had built something useful, but it wasn’t scaling with large workloads.

The breaking point

Our original flamegraph renderer was built on a heavily modified fork of Speedscope, an open-source project. It served us well in the early days, but as our users scaled up their workloads, we ran into serious limitations:

  • Large graphs became painfully slow - The renderer would lag and stutter when handling real-world datasets.
  • Customization was a nightmare – Every new feature required deep, painful refactoring.
  • TypeScript wasn’t cutting it – Traversing massive profile trees in JavaScript simply wasn’t fast enough for the scale we needed

Speedscope is a solid open-source project, and we think they’ve done incredible work. However, it’s a large and complex TypeScript codebase, and as a team of Rust engineers, we chose to play to our strengths and build a renderer in Rust from scratch. Our needs simply evolved in a different direction—one that required us to move faster, reduce technical debt, and fully own the performance path.

Implementation

Our stack consists of Rust, compiled to WASM, with rendering powered by egui and WebGL. This produces the core flamegraph rendering component, which is embedded into our React based frontend.

Our  main goal was to reduce the work that has to be to done, and to lay out the data in CPU and memory friendly ways.

Interning

Strings represent by far the most data we store per frame in a stack trace, as can be seen from the struct we introduced for representing frames:

#[derive(Debug, Eq, Hash, PartialEq)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug, PartialEq)))]
pub struct Frame {
    pub kind: Arc<String>,
    pub name: Arc<String>,
    pub file_name: Option<Arc<String>>,
    pub executable: Option<Arc<String>>,
    #[rkyv(with = rkyv::with::Niche)]
    pub line: Option<NonZero<u32>>,
    pub file_id: Option<[u8; 16]>,
    pub address: Option<u64>,
    pub inline_idx: u16,
    pub meta: Meta,
}

However, many strings are actually duplicated many times across the entire flamegraph, e.g., consider the following flamegraph which was obtained from profiling a CUDA application that launches the same kernel across four different streams:

Relevant code from our CUDA application

for (int i = 0; i < 100; i++) {
  matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream1>>>(d_C, d_A, d_B);
  matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream2>>>(d_C, d_A, d_B);
  matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream3>>>(d_C, d_A, d_B);
  matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream4>>>(d_C, d_A, d_B);
}
  
zymtrace AI flamegraph for a CUDA app

Even in this trivial examples most of the file and executable names, or function names, kernel names and stall reasons (stall reasons tell you exactly why an instruction in a kernel was waiting - a feature of our GPU profiler), are relatively long strings, and are duplicated across the graph.

While switching to our custom renderer implemented in Rust, we switched the backend to intern (i.e. deduplicate) all strings. Not only did this yield performance and memory usage improvements in the backend, it also plays very well with rkyv, which is the serialization library we use for communicating with the WASM renderer.

In addition to supporting zero-copy deserialization (so we never allocate temporary Nodes or strings on the heap for the purpose of iteration), rkyv supports storing deduplicated strings (or anything held by an Arc or Rc) only once. This significantly reduces the amount of data we need to send across the network, and the amount of data we need to process in the frontend.

Even for the simple flamegraph shown above, the reduction in size just from this single optimization compared to JSON in this case is ~100x (the JSON is minified / not pretty printed):

th0rex@void ~/d/wasm-blog-post-25-04-17-5cMY> du -sh raw.rkyv
44K	raw.rkyv
th0rex@void ~/d/wasm-blog-post-25-04-17-5cMY> du -sh flamewire.json
4.4M	flamewire.json

As many frames themselves are fully identical and duplicated across the flamegraph, but in different places of a stack trace, (see e.g. the kernels themselves in the above flamegraph), we also intern and deduplicate the individual frames (except the samples, as those change frequently and wouldn't benefit from deduplication):

#[derive(Clone, Debug)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug, PartialEq)))]
pub struct NodeInfo {
    /// Frame for this node.
    pub frame: Arc<Frame>,

    /// Number of samples, including samples from children.
    pub num_samples: f32,

    /// Number of samples of this node, excluding samples from children.
    pub num_self_samples: f32,
}

(note that we could've implemented e.g. a custom string and frame table for deduplication in JSON as well, but we get this for free with rkyv).

In addition to that, during flamegraph construction in the backend, we merge the common frames of all stack traces, so that

# First stack trace
_start
main
foo
# Second stack trace
_start
main
bar

gets combined to

_start
main        <- main is a node with two children in the flamegraph
{foo, bar}

In Memory Graph Layout

Another improvement we implemented was to optimize the graph layout for traversals. In the browser and WASM renderer we almost never construct new flamegraphs (and the few cases in which we do are still fast enough), so we have two representations of a flamegraph:

  • one is mutable, and used while constructing the flamegraph
  • one is immutable, and is sent over the network / used for rendering

The immutable representation lays out the whole Graph as a single array of nodes:

#[derive(Debug)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug)))]
pub struct Node {
    /// Info assigned to this node.
    pub info: NodeInfo,

    /// Depth in the callstack.
    pub depth: u16,

    /// Number to add to current node index to skip this node's children.
    ///
    /// `None` if this node does not have any children.
    #[rkyv(with = rkyv::with::Niche)]
    pub sibling_offs: Option<NonZero<u32>>,
}

#[derive(Debug)]
#[derive(Archive, Serialize)]
pub struct Flamewire {
    pub flamewire: Vec<Node>,
}

Child nodes directly follow their parent node, e.g., a flamegraph representing these two call stacks.

# First call stack
_start
main
foo
bar

# Second call stack
_start
main
baz

is laid out like this in memory:

{ info: "_start", depth: 0, sibling_offs: Some(4) }
{ info: "main",   depth: 1, sibling_offs: Some(3) }
{ info: "foo",    depth: 2, sibling_offs: Some(1) }
{ info: "bar",    depth: 3, sibling_offs: None }
{ info: "baz",    depth: 2, sibling_offs: None }

The depth can be used to determine if the current node is a child of the previous node, or if a new sibling at the given depth needs to be created, and sibling_offs can be used to skip over any children of the current node, if we determine we don't want to render them (more on that later).

This layout has several advantages:

  • It is contiguous in memory, allowing efficient iteration without chasing (too many) pointers, reducing impact on the CPU cache. Some pointers are still chased due to the deduplication and interning, but we determined that tradeoff to be worth it due to the overall size savings.
  • Very little overhead for encoding the graph structure itself (6 bytes per node).

While just iterating this structure is fairly easy, doing anything useful during iteration is a bit more complicated, so we built a few Iterators on top of the raw slice of nodes that make e.g. visually laying out the graph or diffing two flamegraphs easier.

Definition of the `Iterator` helping us with this, plus an example of why iteration of this format is hard

#[derive(PartialEq)]
pub struct DfsState<'a, T> {
    pub node: &'a ArchivedNodeInfo,
    /// The state carried along the node.
    pub context: T,
}

/// An iterator over full DFS states of a flamegraph.
///
/// The [`ReaderDfsIter`] currently only yields a single node, along with its depth.
/// For some use cases, this is enough. However, for others, we need to know parent (and/or child)
/// nodes as well.
///
/// This iterator solves this problem, without introducing additional fields in the wire format,
/// and with minimal memory allocations. We only keep a stack of currently pushed nodes, and
/// whenever one "chain" (stack trace essentially) of nodes is finished, we yield that state.
///
/// A state is finished if the next node has a depth that is the same or less than that of the
/// current node. While this case would be easy to handle, and wouldn't warrant all the complexity
/// of this, there are additional cases that need to be handled.
///
/// If we "unwind" to a node higher up the stack, we might have to yield *multiple* states, to keep
/// sample counts correct. Consider this example:
///
/// state:     [(A, 10), (B, 5), (C, 3)]
/// next_node: (depth=0, D, 20)
///
/// We first need to yield the current state, i.e.
///
/// [(A, 10), (B, 5), (C, 3)]
///
/// However, we also need to yield `[(A, 10), (B, 2)]` (where 2 is the self samples of `B`), and
/// `[(A, 5)]` (again with 5 being the self samples of `A`).
///
/// After that we can finally accept `(D, 20)` and continue processing that and any potential
/// children.
///
/// This iterator also allows you to carry state (`T`) along each node, to avoid having to
/// recalculate this potentially multiple times. This is currently used for filtering a graph, to
/// store whether a node matches a given filter.
///
/// To extract state from a node, the function `F` is used, which can be specified with in
/// `new_with_fn`. If you don't care about this, just call `new`, which will not carry state along.
///
/// Please note that while this implements `Iterator`, we can't yield the actual item because the
/// lifetime would reference the struct itself, which we can't express. You can call `state` to
/// access the current state whenever this iterator yields `Some(_)`. Notably `state` will always
/// return `Some(_)` if this iterator returned `Some(_)` from `next`, so you can `unwrap` it.
pub struct DfsStateIter<'a, T, F> {
    done: bool,

    inner: ReaderDfsIter<'a>,
    stack: Vec<DfsState<'a, T>>,
    need_pop: bool,
    to_push: Option<('a ArchivedNodeInfo, u16)>,

    extract_fn: F,
}

Culling nodes (doing even less!)

A common theme so far in this implementation has been to do less (e.g. storing less data, or having the graph structure be implicit), and we can continue with this during rendering.

In a flamegraph, node widths correspond to how much time they take. Some nodes don't have enough width to draw legible text at some zoom levels, and only become wider as one zooms into the Graph.

We can use this to our advantage, by not drawing nodes if we determine their width will be too small to convey useful information, and additionally skipping all their children. If the parent node is already too small to draw, then all child nodes will be too small, because they'll never be wider than the parent. This is what we use the sibling_offs for in the Graph, to skip to the next sibling and not iterate over any children.

Additionally, any nodes that are fully to the left, right, or bottom of the current viewport can be skipped too, as those and any children will never be visible, for similar reasons. We can't skip nodes that are to the top of the current viewport, as those might have children that would be visible in the viewport, so we have to iterate over those.

Combining these techniques means that we usually skip most of the nodes in a flamegraph:

  • either a user is not zoomed in, and many nodes are too small to render
  • or a user is zoomed in on a specific section, and we can skip nodes outside of the viewport

The debug mode of our renderer gives exact stats for how many nodes are skipped. Consider this view, which is the same as before, just with the debug menu toggled:

zymtrace AI flamegraph debug mode

The flamegraph has 3276 nodes in total, yet we only draw and process 166 of them.

Similarly, if one zooms in, new nodes become visible (as they're now not too small anymore), but that zooming in pushes other nodes outside of the viewport:

zooming in with CMD+scroll

In this case, only 49 nodes are processed, drawn, and visible. This, with the other optimizations, is what allows us to smoothly draw flamegraphs that contain many hundreds of thousands to millions of nodes.

The results

As a result of the rewrite, we

  • massively reduced technical debt, replacing ~19k lines of typescript code in a
    dependency with ~3k lines of Rust
  • achievement significant performance gains, no more lagging or stuttering, even on massive flamegraphs

The performance gains were mainly achieved by:

  • working with less data (interning, binary format, zero copy deserialization, implicit graph representation)
  • doing less (not rendering nodes outside of the view port, efficiently skipping over nodes and childrens that are not rendered)
  • using a language that gives us more direct control over memory allocations and usage

Get started with zymtrace

Curious how zymtrace can optimize your general-purpose and accelerated workloads? We're more than just a continuous CPU profiling solution. zymtrace is a performance optimization platform for modern heterogeneous compute workloads - spanning CPUs, GPUs, and specialized AI accelerators. Book a demo.

We’re hiring! Passionate about Rust, WebAssembly, and high-performance computing? Join our team.

If you liked that, you'll love these

Product
News
May 1, 2025
Unlocking Efficient AI: zymtrace distributed GPU Profiler, now publicly available
Israel Ogbole
&
Joel Höner
Product
December 23, 2024
Introducing zymtrace, world’s first distributed profiler for GPU & CPU workloads
Israel Ogbole
&
Joel Höner