#include <cstdint>
#include <random>
class SkipList::RandomGenerator{
public:
bool operator()(){
return distr_(gen_);
}
private:
std::mt19937 gen_{ (uint32_t) time(nullptr) };
std::bernoulli_distribution distr_{ 0.5 };
};
SkipList::RandomGenerator SkipList::rand_;
//...
auto SkipList::getRandomHeight_() -> int{
int h = 1;
while( h < height_ && rand_() )
h++;
return h;
}
The problem here is that rand_ is called in the loop. I did measurements and it turn out that this can be much faster if I do it with single rand_() call.
In order to do that I need to find "least significant bit set" or to "count leading zeroes".
Both functions are very similar. For example let suppose we have value of 100:
Binary: 0 1 1 0 0 1 0 0
Pos: 7 6 5 4 3 2 1 0
In this case, because third bit is set, the function should return "2".
Naive implementation
#include <cstdint>
template<typename T>
int lsb(T const x){
for(uint8_t i = 0; i < sizeof(T) * 8; ++i)
if (x & (1 << i) )
return i;
return 0;
}
It is naive, but note this is portable and works for all unsigned types.
However performance of this function is O(n). We sure can do better. Can we?
Pre-computed table
We could do pre-computed lookup table and check from there.For 32bit integers table must be 4GB in size. Not very practical. Because of that we can do it byte by byte:
#include <cstdint>
struct Lookup8{
constexpr Lookup8(){
for(uint8_t i = 0; i < 0xFF; ++i)
data[i] = calc(i);
}
constexpr uint8_t operator[](uint8_t const x) const{
return data[x];
}
private:
constexpr static uint8_t calc(uint8_t const x){
for(uint8_t i = 0; i < 8; ++i)
if (x & (1 << i) )
return i;
return 0;
}
private:
uint8_t data[0xFF] = {};
};
constexpr Lookup8 l8;
template<typename T>
int lsb(T const x){
int j = 0;
for(uint8_t i = 0; i < sizeof(T); ++i){
uint8_t const byte = ( x >> j ) & 0xFF;
if (byte)
return l8[byte] + j;
j += 8;
}
return 0;
}
This is great, but there should be better solution? ...and after some google-ing it is this...
Log2 + bitmask
#include <cstdint>
int log2(uint64_t const x);
constexpr uint64_t lsb_(uint64_t const x){
return x & ( ~x + 1 );
}
constexpr int lsb(uint64_t const x){
return log2(lsb_(x));
}
Wow! That's really short.
...But is log2() O(1) ?
It turns out it is a way to do it in O(1). Algorithm uses something called deBruijn sequence.
De Bruijn sequence
#include <cstdint>
constexpr int log2(uint64_t n){
constexpr uint8_t tab64[64] = {
0, 58, 1, 59, 47, 53, 2, 60,
39, 48, 27, 54, 33, 42, 3, 61,
51, 37, 40, 49, 18, 28, 20, 55,
30, 34, 11, 43, 14, 22, 4, 62,
57, 46, 52, 38, 26, 32, 41, 50,
36, 17, 19, 29, 10, 13, 21, 56,
45, 25, 31, 35, 16, 9, 12, 44,
24, 15, 8, 23, 7, 6, 5, 63
};
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n |= n >> 32;
auto const pos = (n * 0x03F6EAF2CD271461) >> 58;
return tab64[pos];
}
constexpr int log2(uint32_t n){
constexpr uint8_t tab32[32] = {
0, 9, 1, 10, 13, 21, 2, 29,
11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7,
19, 27, 23, 6, 26, 5, 4, 31
};
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
auto const pos = (n * 0x07C4ACDD ) >> 27;
return tab32[pos];
}
template<typename T>
constexpr T lsb_(T const x){
return x & ( ~x + 1 );
}
template<typename T>
constexpr int lsb(T const x){
return log2(lsb_(x));
}
Wait? The code complexity is not O(1) !If you look at "n |= n >> 1", this is nothing else but unrolled loop.
Complexity is at best O(log(n)).
Time for testing
For testing I used something like this code:#include <cstdint>
int main(){
int x = 0;
for(uint64_t i = 0; i <= 100000000; ++i)
x += lsb(i);
std::cout << x << '\n';
}
Then I start it with time. Here are the results: Naive | 0.452 sec
Lookup | 0.550 sec
De Bruijn | 2.394 sec
De Bruijn solution is slowest.Naive solution is fastest.
Bonus
- GCC have non standard function ffs() defined in "string.h" / "cstring".
- Linux have similar function defined somewhere too.