Welcome~~~


Another blog:
http://fun-st.blogspot.com/

It is easier to write an incorrect program than understand a correct one.

Thursday, February 10, 2011

# Select random k elements from a linked list

# Select random k elements from a linked list, when the size of linked list can not be determined
Question;
You have a stream of infinite queries (ie Google searches). Describe how you would find K uniformly distributed samples and write code for it.

----------------Abr-------------------------------
So it will be like this,

In order to randomly select the k element from the linked list, without knowing the length of the linked list first
let's first assume that:

-1- select the first k element into a array (you would like to keep them in an array(kArray), so as for O(1) index purpose)
-2- for the next coming node in the linked lise,
suppose that till know you have seen k+i nodes,
generate a random number     r0<k  ---> so as to randomly chose one from the k-lenght array
generate a random number     r1<k+i----> compare r1/(k+i) with k/(k+i) --> or say, compara r1 with k,
if r1<k, replay the ith new node( actually  the k+i node in the original liked list) with the node in the
array at kArray[r0]

repeat this

------------------------------------------------
From: http://code-slim-jim.blogspot.com/2010/06/reservoir-sampling.html
Reservoir Sampling is an algorithm designed to take K samples from a stream of samples of unknown size in linear time.

How it works.
The idea is simple. Firstly note that until the samples appear you don't know the total number of samples. To avoid this problem you fill up a reservoir of samples and then adjust it as each new sample appears.

First record samples until you have filled the reservoir(K samples). Once you have the first K samples the probability of each sample being in the final result was K/K or 100%.

Now when the K+i sample appears the probability of the new sample appearing in one _PARTICULAR_ output slot becomes 1/K+i. Since there are K potential output slots there is K/(K+i) chance of it being in the output and i/(K+i) chance that it wont be.

So choosing whether or not to add the new sample is easy we toss the dice. If we decide that the new sample is in the output then one of the existing items needs to be removed. Since all the items that where added to output and have equal chance of being present they also have equal chance of being removed. Hence we can simply choose to remove one of the existing samples with a 1/K probability.

This make sense logically be how do we prove it.
  • Assuming the prior stage worked correctly each element in the output should have K/(K+i-1) chance of being present.
  • There is i/(K+i) chance that prior output will be unchanged this round. Hence each element has Ki/(K+i)(K+i-1) chance of being output if the unchanging choice is made for this stage.
  • There is K/(K+i) chance that this something in the output will change
    • And for each element there is (K-1)/K chance it will not be replaced thus a total of (K-1)/K * K/(K+i) * K/(K+i-1) chance it will not be removed.
    • So conversely there is a 1/K * K/(K+i) * K/(K+i-1) chance that it will be replaced.

Hence the net chance of the existing elements making it to the next round is the sum of the chance that nothing changed with the chance that it was not the one that changed.
= Ki/(K+i)(K+i-1) + (K-1)K/(K+i-1)(K+i)
=> (Ki + (K-1)K)/(K+i-1)(K+i)
=> K(i+K-1)/(K+i-1)(K+i)
=> K/(K+i)
Q.E.D.

And of course the chance of the new element making to the output was K/(K+i). As you can see it has nicely balanced out for the possible output items for this stage.

BUT HERE IS THE CATCH.. Note that the mathematics/algorithm considered only the choice between output and non-output state of the samples. IT MADE NO STATEMENT AS TO THE OUTPUTS ORDERING. Hence the output samples need to be randomly shuffled in the final stages of the algorithm or there will be an issue with the ordering of the samples.


This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

;
//compile using g++
#include 
using namespace std;
 
int reservoirSample(int sample, int* samples, int size, int count)
{
  if(count < size)
    samples[count] = sample;
  else
    if((rand()%count) < size)
      samples[rand()%size] = sample;
   
  return ++count;
}
 
#define SIZE 10
// #define STREAM_AVERAGE 300
#define STREAM_AVERAGE 10
int main()
{
  int count = 0;
  int samples[SIZE];
  int sample;
  int i = 0;
 
  srand(time(NULL));
 
  cout << "Sample Stream: " << endl;
  while(
        (count < SIZE) ||
        (rand()%STREAM_AVERAGE > 0)
        )
    {
      sample = rand()%1000;
      cout << sample << " ";
 
      count = reservoirSample(sample, samples, SIZE, count);
    }
  cout << endl;
  cout << "Total samples: " << count << endl;
 
  cout << "Output samples: " << endl;
  for(i = 0;i < SIZE;i++)
    cout << samples[i] << " ";
  cout << endl;
}

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥
http://stackoverflow.com/questions/54059/efficiently-selecting-a-set-of-random-elements-from-a-linked-list http://code-slim-jim.blogspot.com/2010/06/reservoir-sampling.html

No comments:

Post a Comment