cpumemory Ulrich Drepper What Every Programmer Should Know About Memory

[ Pobierz całość w formacie PDF ]

done at userlevel since the measurements use the performance counters of the CPU. These counters require access to
MSRs which, in turn, requires privileges.
Each modern processor provides its own set of performance counters. On some architectures a subset of the counters
are provided by all processor implementations while the others differ from version to version. This makes giving
general advice about the use of oprofile hard. There is not (yet) a higher-level abstraction for the counters which could
hide these details.
The processor version also controls how many events can be traced at any one time, and in which combination. This
adds yet more complexity to the picture.
If the user knows the necessary details about the performance counters, the opcontrol program can be used to select the
events which should be counted. For each event it is necessary to specify the overrun number (the number of events
which must occur before the CPU is interrupted to record an event), whether the event should be counted for userlevel
and/or the kernel, and finally a unit mask (it selects sub-functions of the performance counter).
To count the CPU cycles on x86 and x86-64 processors, one has to issue the following command:
opcontrol --event CPU_CLK_UNHALTED:30000:0:1:1
The number 30000 is the overrun number. Choosing a reasonable value is important for the behavior of the system
and the collected data. It is a bad idea ask to receive data about every single occurrence of the event. For many events,
this would bring the machine to a standstill since all it would do is work on the data collection for the event overrun;
this is why oprofile enforces a minimum value. The minimum values differ for each event since different events have
a different probability of being triggered in normal code.
Choosing a very high number reduces the resolution of the profile. At each overrun oprofile records the address of
the instruction which is executed at that moment; for x86 and PowerPC it can, under some circumstances, record the
backtrace as well.51 With a coarse resolution, the hot spots might not get a representative number of hits; it is all about
probabilities, which is why oprofile is called a probabilistic profiler. The lower the overrun number is the higher the
impact on the system in terms of slowdown but the higher the resolution.
If a specific program is to be profiled, and the system is not used for production, it is often most useful to use the lowest
possible overrun value. The exact value for each event can be queried using
opcontrol --list-events
This might be problematic if the profiled program interacts with another process, and the slowdown causes problems
in the interaction. Trouble can also result if a process has some realtime requirements which cannot be met when it is
interrupted often. In this case a middle ground has to be found. The same is true if the entire system is to be profiled
for extended periods of time. A low overrun number would mean the massive slowdowns. In any case, oprofile, like
any other profiling mechanism, introduces uncertainty and inaccuracy.
The profiling has to be started withopcontrol --startand can be stopped withopcontrol --stop. While
oprofile is active it collects data; this data is first collected in the kernel and then send to a userlevel daemon in batches,
where it is decoded and written to a filesystem. Withopcontrol --dumpit is possible to request all information
buffered in the kernel to be released to userlevel.
The collected data can contain events from different performance counters. The numbers are all kept in parallel unless
the user selects to wipe the stored data in between separate oprofile runs. It is possible to accumulate data from the
same event at different occasions. If an event is encountered during different profiling runs the numbers are added if
this is what is selected by the user.
The userlevel part of the data collection process demultiplexes the data. Data for each file is stored separately. It is
51
Backtrace support will hopefully be available for all architectures at some point.
102 Version 1.0 What Every Programmer Should Know About Memory
even possible to differentiate DSOs used by individual executable and, even, data for individual threads. The data
thus produced can be archived usingoparchive. The file produced by this command can be transported to another
machine and the analysis can be performed there.
With the opreport program one can generate reports from the profiling results. Using opannotate it is possible to see
where the various events happened: which instruction and, if the data is available, in which source line. This makes it
easy to find hot spots. Counting CPU cycles will point out where the most time is spent (this includes cache misses)
while counting retired instructions allows finding where most of the executed instructions are there is a big difference
between the two.
A single hit at an address usually has no meaning. A side effect of statistical profiling is that instructions which are
only executed a few times, or even only once, might be attributed with a hit. In such a case it is necessary to verify the
results through repetition.
B.2 How It Looks Like
An oprofile session can look as simple as this:
$ opcontrol -i cachebench
$ opcontrol -e INST_RETIRED:6000:0:0:1 --start
$ ./cachebench ...
$ opcontrol -h
Note that these commands, including the actual program, are run as root. Running the program as root is done here only
for simplicity; the program can be executed by any user and oprofile would pick up on it. The next step is analyzing
the data. With opreport we see:
CPU: Core 2, speed 1596 MHz (estimated)
Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of
0x00 (No unit mask) count 6000
INST_RETIRED:6000|
samples| %|
------------------
116452 100.000 cachebench
This means we collected a bunch of events; opannotate can now be used to look at the data in more detail. We can see
where in the program the most events were recorded. Part of theopannotate --sourceoutput looks like this:
:static void
:inc (struct l unsigned n)
*l,
:{
: while (n-- > 0) /* inc total: 13980 11.7926
*/
: {
5 0.0042 : ++l->pad[0].l;
13974 11.7875 : l = l->n;
1 8.4e-04 : asm volatile ("" :: "r" (l));
: }
:}
That is the inner function of the test, where a large portion of the time is spent. We see the samples spread out over
all three lines of the loop. The main reason for this is that the sampling is not always 100% accurate with respect [ Pobierz całość w formacie PDF ]