Reservoir Sampling
Reservoir sampling is a family of randomized algorithms for sampling k items from a stream of unknown size n. The reservoir maintains a uniform random sample as new items arrive.
The algorithm:
- Reservoir (sample) initialized with first k elements
- For item i from k to n:
- Randomly replace an element in reservoir with probability k/i
Java example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| int[] reservoirSample(Iterator stream, int k) {
int[] reservoir = new int[k];
// Initialize reservoir
for (int i = 0; i < k; i++) {
reservoir[i] = stream.next();
}
int i = k;
while (stream.hasNext()) {
int next = stream.next();
int j = rand.nextInt(i+1);
if (j < k) {
reservoir[j] = next;
}
i++;
}
return reservoir;
}
|
C++ example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| vector<int> reservoirSample(vector<int>::iterator iter, int k) {
vector<int> reservoir(k);
// Initialize reservoir
for (int i = k; i < n; i++) {
int j = rand() % (i+1);
if (j < k) {
reservoir[j] = *(iter++);
}
}
return reservoir;
}
|
Python example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import random
def reservoir_sample(stream, k):
reservoir = []
for i in range(k):
reservoir.append(next(stream))
i = k
while True:
next_elem = next(stream, None)
if next_elem is None:
break
j = random.randint(0, i)
if j < k:
reservoir[j] = next_elem
i += 1
return reservoir
|
Reservoir sampling provides an unbiased random sample from a stream. Useful for statistics and machine learning.
Reservoir Sampling is an algorithmic technique used for randomly selecting a sample of k
items from a list containing n
items, where n
is either a large or unknown number. The algorithm processes each item only once and uses constant memory, making it highly efficient. The key takeaway is that Reservoir Sampling allows you to maintain a representative sample without knowing the total size of the data stream upfront.
Java Code for Reservoir Sampling
In Java, you can use the java.util.Random
class to generate random numbers for the sampling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import java.util.Random;
public class ReservoirSampling {
public static int[] sample(int[] stream, int k) {
int[] reservoir = new int[k];
Random rand = new Random();
for (int i = 0; i < k; i++) {
reservoir[i] = stream[i];
}
for (int i = k; i < stream.length; i++) {
int j = rand.nextInt(i + 1);
if (j < k) {
reservoir[j] = stream[i];
}
}
return reservoir;
}
}
|
sample(int[] stream, int k)
returns an array of k
randomly selected items from stream
.
C++ Code for Reservoir Sampling
In C++, you can use the <random>
library to generate random numbers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| #include <vector>
#include <random>
std::vector<int> sample(const std::vector<int>& stream, int k) {
std::vector<int> reservoir(k);
std::random_device rd;
std::mt19937 gen(rd());
for (int i = 0; i < k; ++i) {
reservoir[i] = stream[i];
}
for (int i = k; i < stream.size(); ++i) {
std::uniform_int_distribution<> dis(0, i);
int j = dis(gen);
if (j < k) {
reservoir[j] = stream[i];
}
}
return reservoir;
}
|
sample(const std::vector<int>& stream, int k)
returns a vector containing k
random samples from stream
.
Python Code for Reservoir Sampling
Python’s random
library can be used for generating random numbers.
1
2
3
4
5
6
7
8
9
10
11
12
13
| import random
def sample(stream, k):
reservoir = []
for i in range(k):
reservoir.append(stream[i])
for i in range(k, len(stream)):
j = random.randint(0, i)
if j < k:
reservoir[j] = stream[i]
return reservoir
|
sample(stream, k)
returns a list of k
randomly selected items from stream
.
All these implementations use the same underlying algorithm. They maintain a reservoir
array that initially stores the first k
elements. For each subsequent element, they generate a random index and replace the corresponding reservoir
value if the index is within k
. The result is a representative random sample of k
items from the original list or stream.