Welcome~~~


Another blog:
http://fun-st.blogspot.com/

It is easier to write an incorrect program than understand a correct one.

Tuesday, February 15, 2011

'Bit arrays' and 'Bit Vectors'

Some explanations for "Bit Arrays" / "Bit vector" in C/C++ programming:

Basically Bit Array can be treated as a spetical data structure which store bits ( instead of int or char -- using >= 1 byte). That is, each element in Bit Array/Vector is 1 bit with value 0 or 1, or say ON/OFF( or flag ), SET/unSet.
It ofen used as a mapping from the position to some bit value. You can compare it with Bool, which typically use 1 to 4 bytes long.

A very useful function that using Bit Array / Vector is to use the position as index to represent an integer. Give an example as below:
Given millons of number in a file, to find the missing ones.
Suppose we know that there are at most only 1 million numbers, then we use an array of 1 million bit slot ( a bit array with length of 1 million), each slot acts like a position index representing one number.
-1- We first initiate this long array with each slot as ZERO
-2- We start reading the numbers in the file, and use the number as the position index, and change the bit in that position to be one (if it's already one, do not change it)
-3- After we finish reading, the position of the slots which are still ZERO will correspond to numbers that do not exist in the file --- so we have already found those numbers.


for example, if we had only numbers between 0 - 4 the array would look like this in the beginning: 0 0 0 0 0. if we read the numbers : 3, 2, 2 the array would look like this: read 3 --> 0 0 0 1 0. read 3 (again) --> 0 0 0 1 0. read 2 --> 0 0 1 1 0. check the indices of the zeroes: 0,1,4 - those are the missing numbers
BTW, of course you can use integers instead of bits but it may take (depends on the system) 32 times memory

8 Bits = 1 Byte
1 KB  = 1024 Bytes  = 2^10 Byte
1 MB =  1024 KB = 1048576 bytes = 2^20 Byte
1 GB = 1024 MB = 1048576 kb = 1073741824 bytes = 2^30 Byte
1 TB = 1024 Giga Byte (GB) = 2 ^40 Byte
(different with k = 10^3, m = 10^ 6, billion = 10^9)

-----
Check some related articles / discussions
-----
-A> [Problem I]
An interesting interview question from Amazon, some discussions posted at careercup
http://www.careercup.com/question?id=7438661
We have a web ser ver that consists of a log file, which will be updated with the Visitor ID, who ever visits the website. It is a very large file, well formatted and delimited.
We should find all the visitor IDs who visited the website on that day. The constraints are we have 100 GB HD and 100 MB RAM. Virtual memory is not allowed. The time complexity should be less than O(n^2). You can use the whole of 100 MB RAM for any Data Structure.
Note: Consider the Log file is really big compared to 100 MB.
1-) Create a bitmap vector where each bit corresponds to an ID. Assuming number of IDs is less than 2^29~(number of bits in 64 MB)
2-) Read backwards as below suggested
3-) Set the bit corresponding to the current ID
4-) Repeat until previous day
5-) Go through the bit map print the ID's whose bit is set
Complexity O(n) where n is the number of entries.

The key here is that its a well formatted file which means that each line can a be structure like
struct logentry{
char visitorid[20];
char delimiter;
char visitorname[25];
char delimiter;
char date[12]; /* format of date like month dd yyyy */
};
So every time a new user logs in, the structure is instantiated and that value is appended to the log file. So the latest information of current day which we are interested in is always closer to the end of the file. The date is generated using ctime(time(NULL)) and extracting only the second, third and fourth substring which we get after splitting the string output using the delimiter " ".
We can't read the full content of the file since log file size is larger than main memory and since the virtual memory concept can't be assumed.
So we can open the file using fp = fopen("logfile","r") so that we get the file descriptor and make it point to end of file using fseek(fp, 0, SEEK_END). Now since we know the size of the structure we can always fseek(fp, -(sizeof(logentry)+1), SEEK_CUR) to jump to the starting location of the line. The "+1" is for taking into account the newline character after each line. Then use getline(&buffer, &nbytes, fp) to get the current line where nbytes is set as sizeof(logentry) and buffer is the char pointer which points to a dynamically allocated memory to fit the line fetched from the log file.
We split the line specifying the delimiter say ":" or "|". strtok() function can used for the same. Then we look for the third string and if the third string matches the current date. If yes then the visitorid is pushed into the hashmap <key, value> = <visitorid, visitorname>
The termination condition is when we get a line which has a date not equal to current date and output will be all the keys stored in hashmap.

Compare to Hash-map: 
Notes that here using the bit vector rather than hash-map is because that using hash-map still need a lot of space. And actually, we know the limit of possible ID -- there is a range. 
Compare to External Merge sort:

External Merge sort will work. If 1000 sorted files have been created, merge them using a heap (priority queue) of size 1000 and a buffer of size = available memory. Keep flushing the buffer to the merged results file until the duplicate is found.
Bit vector idea works as well - stop as soon you are setting a bit that's already set. However, the number of passes through the 4 billion integer file can be thousands, thereby slowing down things.

-B> [Problem II]
http://www.careercup.com/question?id=3058
Given 1 GB memory, input a file which contains 4 billion integers, output one integer that is not in the file. What if you have only 10 MB memory?
Allocate 32 signed long variables for taking a count of 0 and 1. Total memory taken so far is 8*32 = 256 bytes. Now input the numbers one by one and scan the bit pattern of the number. Highest MSB corresponds to the integer 32 and LSB corresponds to integer 1. Now if the MSB is 1 increment the value of integer 32. If it is 0 decrement it by 1. Do it for all the bits. After you finish scanning all the numbers in the file, go through integers 32 to 1 and that will give you the bit pattern of the number missing in the file. This solution is based on the assumption that the total number of numbers is a power of 2 and the fact that when we write bit pattern for all the numbers all the column in the bit pattern will have the same number of 0s and 1s.

-C> [Problem III]-- A similar one using the bit-vectorGiven a tree in which each node is an integer and an array with a set of integers, determine if all the elements of the array are present in the tree by visiting each node in the tree at most once.
http://www.careercup.com/question?id=208906
For BST, here is my solution:
1. Sort the array (nlogn)
2. Create a bit vector if the size 'n'
3. Now given a root of a tree, find if its value is present in the array(log n)
If yes, update the bit vector with 1(means present), call this recursively for
a. left subtree of root and subarray(start to mid)
b. right subtree of root and subarray(mid, end)
4. For the bit vector, just check if bit vector contains any 0, (few fast solutions are present) for e.g.
if (bit_vector_value > 0 && (bit_vector_value & (bit_vector_value+1) == 0)) ,
then all positions in bit vector are '1'

Some Refs:
http://stackoverflow.com/questions/4604130/c-c-bit-array-or-bit-vector

No comments:

Post a Comment