[ Pobierz całość w formacie PDF ]
done at userlevel since the measurements use the performance counters of the CPU. These counters require access to MSRs which, in turn, requires privileges. Each modern processor provides its own set of performance counters. On some architectures a subset of the counters are provided by all processor implementations while the others differ from version to version. This makes giving general advice about the use of oprofile hard. There is not (yet) a higher-level abstraction for the counters which could hide these details. The processor version also controls how many events can be traced at any one time, and in which combination. This adds yet more complexity to the picture. If the user knows the necessary details about the performance counters, the opcontrol program can be used to select the events which should be counted. For each event it is necessary to specify the overrun number (the number of events which must occur before the CPU is interrupted to record an event), whether the event should be counted for userlevel and/or the kernel, and finally a unit mask (it selects sub-functions of the performance counter). To count the CPU cycles on x86 and x86-64 processors, one has to issue the following command: opcontrol --event CPU_CLK_UNHALTED:30000:0:1:1 The number 30000 is the overrun number. Choosing a reasonable value is important for the behavior of the system and the collected data. It is a bad idea ask to receive data about every single occurrence of the event. For many events, this would bring the machine to a standstill since all it would do is work on the data collection for the event overrun; this is why oprofile enforces a minimum value. The minimum values differ for each event since different events have a different probability of being triggered in normal code. Choosing a very high number reduces the resolution of the profile. At each overrun oprofile records the address of the instruction which is executed at that moment; for x86 and PowerPC it can, under some circumstances, record the backtrace as well.51 With a coarse resolution, the hot spots might not get a representative number of hits; it is all about probabilities, which is why oprofile is called a probabilistic profiler. The lower the overrun number is the higher the impact on the system in terms of slowdown but the higher the resolution. If a specific program is to be profiled, and the system is not used for production, it is often most useful to use the lowest possible overrun value. The exact value for each event can be queried using opcontrol --list-events This might be problematic if the profiled program interacts with another process, and the slowdown causes problems in the interaction. Trouble can also result if a process has some realtime requirements which cannot be met when it is interrupted often. In this case a middle ground has to be found. The same is true if the entire system is to be profiled for extended periods of time. A low overrun number would mean the massive slowdowns. In any case, oprofile, like any other profiling mechanism, introduces uncertainty and inaccuracy. The profiling has to be started withopcontrol --startand can be stopped withopcontrol --stop. While oprofile is active it collects data; this data is first collected in the kernel and then send to a userlevel daemon in batches, where it is decoded and written to a filesystem. Withopcontrol --dumpit is possible to request all information buffered in the kernel to be released to userlevel. The collected data can contain events from different performance counters. The numbers are all kept in parallel unless the user selects to wipe the stored data in between separate oprofile runs. It is possible to accumulate data from the same event at different occasions. If an event is encountered during different profiling runs the numbers are added if this is what is selected by the user. The userlevel part of the data collection process demultiplexes the data. Data for each file is stored separately. It is 51 Backtrace support will hopefully be available for all architectures at some point. 102 Version 1.0 What Every Programmer Should Know About Memory even possible to differentiate DSOs used by individual executable and, even, data for individual threads. The data thus produced can be archived usingoparchive. The file produced by this command can be transported to another machine and the analysis can be performed there. With the opreport program one can generate reports from the profiling results. Using opannotate it is possible to see where the various events happened: which instruction and, if the data is available, in which source line. This makes it easy to find hot spots. Counting CPU cycles will point out where the most time is spent (this includes cache misses) while counting retired instructions allows finding where most of the executed instructions are there is a big difference between the two. A single hit at an address usually has no meaning. A side effect of statistical profiling is that instructions which are only executed a few times, or even only once, might be attributed with a hit. In such a case it is necessary to verify the results through repetition. B.2 How It Looks Like An oprofile session can look as simple as this: $ opcontrol -i cachebench $ opcontrol -e INST_RETIRED:6000:0:0:1 --start $ ./cachebench ... $ opcontrol -h Note that these commands, including the actual program, are run as root. Running the program as root is done here only for simplicity; the program can be executed by any user and oprofile would pick up on it. The next step is analyzing the data. With opreport we see: CPU: Core 2, speed 1596 MHz (estimated) Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 6000 INST_RETIRED:6000| samples| %| ------------------ 116452 100.000 cachebench This means we collected a bunch of events; opannotate can now be used to look at the data in more detail. We can see where in the program the most events were recorded. Part of theopannotate --sourceoutput looks like this: :static void :inc (struct l unsigned n) *l, :{ : while (n-- > 0) /* inc total: 13980 11.7926 */ : { 5 0.0042 : ++l->pad[0].l; 13974 11.7875 : l = l->n; 1 8.4e-04 : asm volatile ("" :: "r" (l)); : } :} That is the inner function of the test, where a large portion of the time is spent. We see the samples spread out over all three lines of the loop. The main reason for this is that the sampling is not always 100% accurate with respect [ Pobierz całość w formacie PDF ] |