The zymtrace AI flamegraph renderer was built from the ground up in Rust and WebAssembly (WASM), replacing 19k
lines of TypeScript with 3k
lines of Rust and achieving over 6×
faster rendering—eliminating lag and stutter completely.
This blog describes how we built the AI flamegraph, why we chose Rust, WASM, and the key techniques that made it fast, efficient, and scalable.
When we launched zymtrace back in December 2024, we prioritized speed over perfection, taking shortcuts to ship quickly and gather feedback. That worked for a while, but as more customers started profiling diverse workloads, issues with our flamegraph and sandwich views became apparent. The tool meant to help diagnose performance bottlenecks was itself struggling. It was frustrating — we had built something useful, but it wasn’t scaling with large workloads.
Our original flamegraph renderer was built on a heavily modified fork of Speedscope, an open-source project. It served us well in the early days, but as our users scaled up their workloads, we ran into serious limitations:
Speedscope is a solid open-source project, and we think they’ve done incredible work. However, it’s a large and complex TypeScript codebase, and as a team of Rust engineers, we chose to play to our strengths and build a renderer in Rust from scratch. Our needs simply evolved in a different direction—one that required us to move faster, reduce technical debt, and fully own the performance path.
Our stack consists of Rust, compiled to WASM, with rendering powered by egui and WebGL. This produces the core flamegraph rendering component, which is embedded into our React based frontend.
Our main goal was to reduce the work that has to be to done, and to lay out the data in CPU and memory friendly ways.
Strings represent by far the most data we store per frame in a stack trace, as can be seen from the struct we introduced for representing frames:
#[derive(Debug, Eq, Hash, PartialEq)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug, PartialEq)))]
pub struct Frame {
pub kind: Arc<String>,
pub name: Arc<String>,
pub file_name: Option<Arc<String>>,
pub executable: Option<Arc<String>>,
#[rkyv(with = rkyv::with::Niche)]
pub line: Option<NonZero<u32>>,
pub file_id: Option<[u8; 16]>,
pub address: Option<u64>,
pub inline_idx: u16,
pub meta: Meta,
}
However, many strings are actually duplicated many times across the entire flamegraph, e.g., consider the following flamegraph which was obtained from profiling a CUDA
application that launches the same kernel across four different streams:
for (int i = 0; i < 100; i++) {
matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream1>>>(d_C, d_A, d_B);
matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream2>>>(d_C, d_A, d_B);
matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream3>>>(d_C, d_A, d_B);
matrixMul<<<blocksPerGrid, threadsPerBlock, 0, stream4>>>(d_C, d_A, d_B);
}
Even in this trivial examples most of the file and executable names, or function names, kernel names and stall reasons (stall reasons tell you exactly why an instruction in a kernel was waiting - a feature of our GPU profiler), are relatively long strings, and are duplicated across the graph.
While switching to our custom renderer implemented in Rust, we switched the backend to intern (i.e. deduplicate) all strings. Not only did this yield performance and memory usage improvements in the backend, it also plays very well with rkyv, which is the serialization library we use for communicating with the WASM renderer.
In addition to supporting zero-copy deserialization (so we never allocate temporary Nodes or strings on the heap for the purpose of iteration), rkyv supports storing deduplicated strings (or anything held by an Arc or Rc) only once. This significantly reduces the amount of data we need to send across the network, and the amount of data we need to process in the frontend.
Even for the simple flamegraph shown above, the reduction in size just from this single optimization compared to JSON in this case is ~100x (the JSON is minified / not pretty printed):
th0rex@void ~/d/wasm-blog-post-25-04-17-5cMY> du -sh raw.rkyv
44K raw.rkyv
th0rex@void ~/d/wasm-blog-post-25-04-17-5cMY> du -sh flamewire.json
4.4M flamewire.json
As many frames themselves are fully identical and duplicated across the flamegraph, but in different places of a stack trace, (see e.g. the kernels themselves in the above flamegraph), we also intern and deduplicate the individual frames (except the samples, as those change frequently and wouldn't benefit from deduplication):
#[derive(Clone, Debug)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug, PartialEq)))]
pub struct NodeInfo {
/// Frame for this node.
pub frame: Arc<Frame>,
/// Number of samples, including samples from children.
pub num_samples: f32,
/// Number of samples of this node, excluding samples from children.
pub num_self_samples: f32,
}
(note that we could've implemented e.g. a custom string and frame table for deduplication in JSON as well, but we get this for free with rkyv
).
In addition to that, during flamegraph construction in the backend, we merge the common frames of all stack traces, so that
# First stack trace
_start
main
foo
# Second stack trace
_start
main
bar
gets combined to
_start
main <- main is a node with two children in the flamegraph
{foo, bar}
Another improvement we implemented was to optimize the graph layout for traversals. In the browser and WASM
renderer we almost never construct new flamegraphs (and the few cases in which we do are still fast enough), so we have two representations of a flamegraph:
The immutable representation lays out the whole Graph as a single array of nodes:
#[derive(Debug)]
#[derive(Archive, Serialize)]
#[rkyv(attr(derive(Debug)))]
pub struct Node {
/// Info assigned to this node.
pub info: NodeInfo,
/// Depth in the callstack.
pub depth: u16,
/// Number to add to current node index to skip this node's children.
///
/// `None` if this node does not have any children.
#[rkyv(with = rkyv::with::Niche)]
pub sibling_offs: Option<NonZero<u32>>,
}
#[derive(Debug)]
#[derive(Archive, Serialize)]
pub struct Flamewire {
pub flamewire: Vec<Node>,
}
Child nodes directly follow their parent node, e.g., a flamegraph representing these two call stacks.
# First call stack
_start
main
foo
bar
# Second call stack
_start
main
baz
is laid out like this in memory:
{ info: "_start", depth: 0, sibling_offs: Some(4) }
{ info: "main", depth: 1, sibling_offs: Some(3) }
{ info: "foo", depth: 2, sibling_offs: Some(1) }
{ info: "bar", depth: 3, sibling_offs: None }
{ info: "baz", depth: 2, sibling_offs: None }
The depth
can be used to determine if the current node is a child of the previous node, or if a new sibling at the given depth needs to be created, and sibling_offs
can be used to skip over any children of the current node, if we determine we don't want to render them (more on that later).
This layout has several advantages:
While just iterating this structure is fairly easy, doing anything useful during iteration is a bit more complicated, so we built a few Iterator
s on top of the raw slice of nodes that make e.g. visually laying out the graph or diffing two flamegraphs easier.
#[derive(PartialEq)]
pub struct DfsState<'a, T> {
pub node: &'a ArchivedNodeInfo,
/// The state carried along the node.
pub context: T,
}
/// An iterator over full DFS states of a flamegraph.
///
/// The [`ReaderDfsIter`] currently only yields a single node, along with its depth.
/// For some use cases, this is enough. However, for others, we need to know parent (and/or child)
/// nodes as well.
///
/// This iterator solves this problem, without introducing additional fields in the wire format,
/// and with minimal memory allocations. We only keep a stack of currently pushed nodes, and
/// whenever one "chain" (stack trace essentially) of nodes is finished, we yield that state.
///
/// A state is finished if the next node has a depth that is the same or less than that of the
/// current node. While this case would be easy to handle, and wouldn't warrant all the complexity
/// of this, there are additional cases that need to be handled.
///
/// If we "unwind" to a node higher up the stack, we might have to yield *multiple* states, to keep
/// sample counts correct. Consider this example:
///
/// state: [(A, 10), (B, 5), (C, 3)]
/// next_node: (depth=0, D, 20)
///
/// We first need to yield the current state, i.e.
///
/// [(A, 10), (B, 5), (C, 3)]
///
/// However, we also need to yield `[(A, 10), (B, 2)]` (where 2 is the self samples of `B`), and
/// `[(A, 5)]` (again with 5 being the self samples of `A`).
///
/// After that we can finally accept `(D, 20)` and continue processing that and any potential
/// children.
///
/// This iterator also allows you to carry state (`T`) along each node, to avoid having to
/// recalculate this potentially multiple times. This is currently used for filtering a graph, to
/// store whether a node matches a given filter.
///
/// To extract state from a node, the function `F` is used, which can be specified with in
/// `new_with_fn`. If you don't care about this, just call `new`, which will not carry state along.
///
/// Please note that while this implements `Iterator`, we can't yield the actual item because the
/// lifetime would reference the struct itself, which we can't express. You can call `state` to
/// access the current state whenever this iterator yields `Some(_)`. Notably `state` will always
/// return `Some(_)` if this iterator returned `Some(_)` from `next`, so you can `unwrap` it.
pub struct DfsStateIter<'a, T, F> {
done: bool,
inner: ReaderDfsIter<'a>,
stack: Vec<DfsState<'a, T>>,
need_pop: bool,
to_push: Option<('a ArchivedNodeInfo, u16)>,
extract_fn: F,
}
A common theme so far in this implementation has been to do less (e.g. storing less data, or having the graph structure be implicit), and we can continue with this during rendering.
In a flamegraph, node widths correspond to how much time they take. Some nodes don't have enough width to draw legible text at some zoom levels, and only become wider as one zooms into the Graph.
We can use this to our advantage, by not drawing nodes if we determine their width will be too small to convey useful information, and additionally skipping all their children. If the parent node is already too small to draw, then all child nodes will be too small, because they'll never be wider than the parent. This is what we use the sibling_offs
for in the Graph, to skip to the next sibling and not iterate over any children.
Additionally, any nodes that are fully to the left, right, or bottom of the current viewport can be skipped too, as those and any children will never be visible, for similar reasons. We can't skip nodes that are to the top of the current viewport, as those might have children that would be visible in the viewport, so we have to iterate over those.
Combining these techniques means that we usually skip most of the nodes in a flamegraph:
The debug mode of our renderer gives exact stats for how many nodes are skipped. Consider this view, which is the same as before, just with the debug menu toggled:
The flamegraph has 3276
nodes in total, yet we only draw and process 166
of them.
Similarly, if one zooms in, new nodes become visible (as they're now not too small anymore), but that zooming in pushes other nodes outside of the viewport:
In this case, only 49
nodes are processed, drawn, and visible. This, with the other optimizations, is what allows us to smoothly draw flamegraphs that contain many hundreds of thousands to millions of nodes.
As a result of the rewrite, we
The performance gains were mainly achieved by:
Curious how zymtrace can optimize your general-purpose and accelerated workloads? We're more than just a continuous CPU profiling solution. zymtrace is a performance optimization platform for modern heterogeneous compute workloads - spanning CPUs, GPUs, and specialized AI accelerators. Book a demo.
We’re hiring! Passionate about Rust, WebAssembly, and high-performance computing? Join our team.