#include <cstdint> #include <random> class SkipList::RandomGenerator{ public: bool operator()(){ return distr_(gen_); } private: std::mt19937 gen_{ (uint32_t) time(nullptr) }; std::bernoulli_distribution distr_{ 0.5 }; }; SkipList::RandomGenerator SkipList::rand_; //... auto SkipList::getRandomHeight_() -> int{ int h = 1; while( h < height_ && rand_() ) h++; return h; }
The problem here is that rand_ is called in the loop. I did measurements and it turn out that this can be much faster if I do it with single rand_() call.
In order to do that I need to find "least significant bit set" or to "count leading zeroes".
Both functions are very similar. For example let suppose we have value of 100:
Binary: 0 1 1 0 0 1 0 0 Pos: 7 6 5 4 3 2 1 0
In this case, because third bit is set, the function should return "2".
Naive implementation
#include <cstdint> template<typename T> int lsb(T const x){ for(uint8_t i = 0; i < sizeof(T) * 8; ++i) if (x & (1 << i) ) return i; return 0; }
It is naive, but note this is portable and works for all unsigned types.
However performance of this function is O(n). We sure can do better. Can we?
Pre-computed table
We could do pre-computed lookup table and check from there.For 32bit integers table must be 4GB in size. Not very practical. Because of that we can do it byte by byte:
#include <cstdint> struct Lookup8{ constexpr Lookup8(){ for(uint8_t i = 0; i < 0xFF; ++i) data[i] = calc(i); } constexpr uint8_t operator[](uint8_t const x) const{ return data[x]; } private: constexpr static uint8_t calc(uint8_t const x){ for(uint8_t i = 0; i < 8; ++i) if (x & (1 << i) ) return i; return 0; } private: uint8_t data[0xFF] = {}; }; constexpr Lookup8 l8; template<typename T> int lsb(T const x){ int j = 0; for(uint8_t i = 0; i < sizeof(T); ++i){ uint8_t const byte = ( x >> j ) & 0xFF; if (byte) return l8[byte] + j; j += 8; } return 0; }
This is great, but there should be better solution? ...and after some google-ing it is this...
Log2 + bitmask
#include <cstdint> int log2(uint64_t const x); constexpr uint64_t lsb_(uint64_t const x){ return x & ( ~x + 1 ); } constexpr int lsb(uint64_t const x){ return log2(lsb_(x)); }
Wow! That's really short.
...But is log2() O(1) ?
It turns out it is a way to do it in O(1). Algorithm uses something called deBruijn sequence.
De Bruijn sequence
#include <cstdint> constexpr int log2(uint64_t n){ constexpr uint8_t tab64[64] = { 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3, 61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4, 62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21, 56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5, 63 }; n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16; n |= n >> 32; auto const pos = (n * 0x03F6EAF2CD271461) >> 58; return tab64[pos]; } constexpr int log2(uint32_t n){ constexpr uint8_t tab32[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16; auto const pos = (n * 0x07C4ACDD ) >> 27; return tab32[pos]; } template<typename T> constexpr T lsb_(T const x){ return x & ( ~x + 1 ); } template<typename T> constexpr int lsb(T const x){ return log2(lsb_(x)); }Wait? The code complexity is not O(1) !
If you look at "n |= n >> 1", this is nothing else but unrolled loop.
Complexity is at best O(log(n)).
Time for testing
For testing I used something like this code:#include <cstdint> int main(){ int x = 0; for(uint64_t i = 0; i <= 100000000; ++i) x += lsb(i); std::cout << x << '\n'; }Then I start it with time. Here are the results:
Naive | 0.452 sec Lookup | 0.550 sec De Bruijn | 2.394 secDe Bruijn solution is slowest.
Naive solution is fastest.
Bonus
- GCC have non standard function ffs() defined in "string.h" / "cstring".
- Linux have similar function defined somewhere too.
No comments:
Post a Comment