Description
I created a sample to study the TLB access/misses statistics. Sample writes 1
to every 4096-th element of the array. Array has 10'000 * 4096
bytes. I expect to see 10'000
TLB stores only, but generated assembly loads beginning of the array every iteration, resulting in 10'000
TLB loads in addition to stores. -O3
optimization is applied
When I looked into the assembly, I noticed that the for-loop looks like that:
- move beginning of the array to the register
- set to 1 a shifted beginning of the array
- increase index
- jump to step 1
Question: Why step 1 is executed every single iteration? Beginning of the array is not changing. I expect the beginning to be loaded once and the jump to be to step 2
C code
(main just calls this test_function
10K times):
#define PAGESIZE 4096#define PAGES 10000char *data = (char *) malloc(PAGES * PAGESIZE);inline void test_function(){ for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) { data[i] = 1; }}
Generated assembly with gcc and -O3
1070: mov rdx,QWORD PTR [rip+0x2fa1] # 4018 <data> 1077: mov BYTE PTR [rdx+rax*1],0x1 107b: add rax,0x1000 1081: cmp rax,0x2710000 1087: jne 1070 <main+0x10>
perf stats for 100'000 repetitions
Per function call we can see:
- 10K L1 cache loads, 10K L1 cache stores
- 10K TLB loads, 10K TLB stores
- ~0 TLB load misses, 10K TLB store missesSo load of the array's beginning is always cached in TLB, but it's still accessed. Why?
1000184312 L1-dcache-load:u (66.60%) 1001155723 L1-dcache-stores:u (66.63%) 1010296235 dTLB-loads:u (66.61%) 1000451484 dTLB-stores:u (66.69%) 42124 dTLB-loads-misses:u # 0.00% of all dTLB cache accesses (66.79%) 998312626 dTLB-stores-misses:u (66.68%)
Platform
Intel(R) Core(TM) i7-10610U CPU
Ubuntu 22.04.3 LTS
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Compilation line: g++ ./tlb.cpp -O3 -g -o gcc.out