Cache Line Alignment in C++ — How It Makes Your Program Faster.
As discussed in Memory and Data Alignment in C++, memory alignment is commonly used for optimisation. One of the methods is cache line alignment.
Cache Lines
As a prerequisite, data is moved between the CPU cache and main memory in fixed size blocks, known as cache lines. The typical size of a cache line is 64 bytes!
C++17 provides a portable way of acquiring the cache line size through std::hardware_destructive_interference_size
.
Note: I did not use the aforementioned in this demo, as the GCC version I was using does not have it implemented yet (implemented in 12.1).
Sharing Cache Lines
If data is close to each other in memory, it is very likely that they would end up in the same cache line. This would adversely affect performance when multiple cores need to access these data, because the cores would have to bounce the cache line between their local caches!
When a core needs to modify data that happens to live in the same cache line as data that is being used by another core, time will be wasted to wait for the other core to release the data. This is commonly known as false sharing.
Not to get this confused with synchronising access, like guarding shared data/memory with a mutex. The cores are accessing completely separate data, but they just happen to live in the same cache line.
Cache Line Alignment
A way to combat this is to make the data cache line aligned. If the cache line size 64 bytes, this means allocating the data on a 64-byte boundary (I highly recommend reading my story on memory alignment if this is not very clear to you!). This ensures data will not live in the same cache line as another.
The performance benefit of this can be demonstrated with a simple application.
#include <thread>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <iostream>
struct A
{
int mInt = 0;
};
int main()
{
// Initialise array
A* arr = new A[2];
// Seed rand
std::srand(std::time(nullptr));
// Increment the variable repeatedly
auto process = [](int* num) {
for(int i = 0; i < 100000000; i++)
*num = *num + std::rand();
};
// Starting time
auto startTime = std::chrono::high_resolution_clock::now();
// Spawn and wait for threads to finish
std::thread t1(process, &arr[0].mInt);
std::thread t2(process, &arr[1].mInt);
t1.join();
t2.join();
// Finish time
auto endTime = std::chrono::high_resolution_clock::now();
// Get results
std::cout << "Duration: "
<< std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime).count() / 1000.f
<< " ms" << std::endl;
// Deallocate
delete[] arr;
return 0;
}
Here, we are incrementing two integers in two separate threads. Each will have their own integer to increment.
For all the nerds out there, I am using the following compiler :)
$ g++ --version
g++ (GCC) 11.4.0
$ g++ -dumpmachine
x86_64-pc-cygwin
Without cache line alignment, the code runs in 518.364 ms.
$ g++ cachelinealignment.cpp -o cachelinealignment
$ ./cachelinealignment
Duration: 518.364 ms
In order to align to the cache line, I will make the following changes
// Align to 64-byte boundary
struct alignas(64) A
{
int mInt = 0;
};
// Just for completeness, assert correct alignment
static_assert(alignof(A) == 64);
With cache line alignment, we see an improvement in performance — 265.304 ms
$ g++ cachelinealignment.cpp -o cachelinealignment
$ ./cachelinealignment
Duration: 265.304 ms
For completeness, here is the final source.
#include <thread>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <iostream>
struct alignas(64) A
{
int mInt = 0;
};
int main()
{
static_assert(alignof(A) == 64);
// Initialise array
A* arr = new A[2];
// Seed rand
std::srand(std::time(nullptr));
// Increment the variable repeatedly
auto process = [](int* num) {
for(int i = 0; i < 100000000; i++)
*num = *num + std::rand();
};
// Starting time
auto startTime = std::chrono::high_resolution_clock::now();
// Spawn and wait for threads to finish
std::thread t1(process, &arr[0].mInt);
std::thread t2(process, &arr[1].mInt);
t1.join();
t2.join();
// Finish time
auto endTime = std::chrono::high_resolution_clock::now();
// Get results
std::cout << "Duration: "
<< std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime).count() / 1000.f
<< " ms" << std::endl;
// Deallocate
delete[] arr;
return 0;
}
Hopefully that made cache line alignment easier to understand! The performance gains from doing such optimisation would increase, as the number of cores accessing data in the same cache line increases.
Feel free to leave a comment if there are any doubts, or something you would like to add!