Use vector in ThreadScope to Reduce the Memory Usage
As described in the issue ThreadScope#29, the ThreadScope is really slow
when loading a relatively huge eventlog file, and consumed an unacceptable volume
of memory. High memory usage caused by using incorrect data structure is
fairly common in many Haskell applications, thus at first glance, I focused
on the internal data structure that ThreadScope uses when browsing the source code.
I found that replacing List
using Vector
could bring a significant improvement.
How ThreadScope Processing Events
At first, we need to know how ThreadScope processing event logs. After the user
selects an eventlog file in the dialog window, ThreadScope starts reading the
event log using the readEventLogFromFile
from the ghc-events package to load
the whole file and parse it into a sequence of Event
data. ThreadScope will
sort those events by their cap
field and partition events by their cap
,
then build views for UI using mkDurationTree
, mkEventTree
, and mkSparkTree
.
The processing on events is a serials of map
, filter
, groupBy
and sort
operations. For a list that contains millions of events, the operation cannot
be every memory efficient and the performance won’t be very good as well.
Therefore more efficient arrays should be used.
When the user drags the timeline bar, the state of timeline will be updated, and corresponding parts of event information will be shown in the current view. All events will always be kept in memory not matter what time range the user have selected on timeline bar. This is the root cause that ThreadScope cannot handle large event logs within limit memory.
Why List is Bad
List
(or []
) is the most commonly used container to represent sequential data
in Haskell. The List
is implemented using the Linked List data structure. Its
implementation has been heavily optimized and the performance could even beat
Linked List implemented using C! But the overhead is also obvious that many
there are more extra field and more indirections in List
, compared to Vector
,
especially when meets a large amount of elements.
From the last section, we know that ThreadScope needs to do many mapping, filtering,
grouping by, and even sorting operations on the event logs. It is data intensive.
Under such settings, the List
is not a good choice. A compact data structure
could save a lot of memory and bump up the speed. Then I tried with the efficient
array package vector
.
ThreadScope uses the ghc-events
package to read events from eventlog file and parse
the serialized bytes to structured data. ghc-events
returns events as [Event]
thus, the first task is returning Vector Event
, rather than [Event]
, in ghc-events.
The work is straight forward and can be found in 2067776d64:
- Update the type signatures to use
Vector
- Replacing operations on
List
with its counterpart in thevector
package - Resolve all compiler errors (the powerful type system of Haskell makes the refactor job so easy)
The main event processing logic is in ThreadScope, in Events/ReadEvents.hs
. Thanks
to the power type checking of GHC again, now the task is to eliminate all type mismatch
errors after upgrade to the vector
version of ghc-events
. By replacing the
map/reduce/groupBy/sort
on List
with the same thing from the vector
package, we
made a ThreadScope that backed by the more efficient array Vector
without many difficulties.
The work can be found in 1ce3cde310.
Comparison and Conclusion
From the experiment on the prototype with commit 2067776d64 and 1ce3cde310,
for an eventlog file of size about 50MB, containing about 2,000,000 event records,
using Vector
can reduce the memory usage from about 1.3 GB to 800 MB, that is a
great win!
The implementation is very preliminary and can be improved, indicates that there is still a chance to optimize ThreadScope by using a more suitable data structure. However, just doing that is not enough, since for eventlog file with GBs size, keeping all event records all the time is a bad idea and the correct fix for it must be partial loading, processing and rendering, and that will be the key of this GSoC project.