As described in the issue ThreadScope#29, the ThreadScope is really slow
when loading a relatively huge eventlog file, and consumed an unacceptable volume
of memory. High memory usage caused by using incorrect data structure is
fairly common in many Haskell applications, thus at first glance, I focused
on the internal data structure that ThreadScope uses when browsing the source code.
I found that replacing
Vector could bring a significant improvement.
How ThreadScope Processing Events
At first, we need to know how ThreadScope processing event logs. After the user
selects an eventlog file in the dialog window, ThreadScope starts reading the
event log using the
readEventLogFromFile from the ghc-events package to load
the whole file and parse it into a sequence of
Event data. ThreadScope will
sort those events by their
cap field and partition events by their
then build views for UI using
The processing on events is a serials of
operations. For a list that contains millions of events, the operation cannot
be every memory efficient and the performance won’t be very good as well.
Therefore more efficient arrays should be used.
When the user drags the timeline bar, the state of timeline will be updated, and corresponding parts of event information will be shown in the current view. All events will always be kept in memory not matter what time range the user have selected on timeline bar. This is the root cause that ThreadScope cannot handle large event logs within limit memory.
Why List is Bad
) is the most commonly used container to represent sequential data
in Haskell. The
List is implemented using the Linked List data structure. Its
implementation has been heavily optimized and the performance could even beat
Linked List implemented using C! But the overhead is also obvious that many
there are more extra field and more indirections in
List, compared to
especially when meets a large amount of elements.
From the last section, we know that ThreadScope needs to do many mapping, filtering,
grouping by, and even sorting operations on the event logs. It is data intensive.
Under such settings, the
List is not a good choice. A compact data structure
could save a lot of memory and bump up the speed. Then I tried with the efficient
ThreadScope uses the
ghc-events package to read events from eventlog file and parse
the serialized bytes to structured data.
ghc-events returns events as
thus, the first task is returning
Vector Event, rather than
[Event], in ghc-events.
The work is straight forward and can be found in 2067776d64:
- Update the type signatures to use
- Replacing operations on
Listwith its counterpart in the
- Resolve all compiler errors (the powerful type system of Haskell makes the refactor job so easy)
The main event processing logic is in ThreadScope, in
to the power type checking of GHC again, now the task is to eliminate all type mismatch
errors after upgrade to the
vector version of
ghc-events. By replacing the
List with the same thing from the
vector package, we
made a ThreadScope that backed by the more efficient array
Vector without many difficulties.
The work can be found in 1ce3cde310.
Comparison and Conclusion
From the experiment on the prototype with commit 2067776d64 and 1ce3cde310,
for an eventlog file of size about 50MB, containing about 2,000,000 event records,
Vector can reduce the memory usage from about 1.3 GB to 800 MB, that is a
The implementation is very preliminary and can be improved, indicates that there is still a chance to optimize ThreadScope by using a more suitable data structure. However, just doing that is not enough, since for eventlog file with GBs size, keeping all event records all the time is a bad idea and the correct fix for it must be partial loading, processing and rendering, and that will be the key of this GSoC project.