Alg-Code: # Retrieving the max top 100 numbers from one hundred million of numbers

Wednesday, February 9, 2011

# Retrieving the max top 100 numbers from one hundred million of numbers

A Python Answer: O(n*log(m)),

Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

import heapq

def funnel(n, numbers):
    if n == 0: return []
    heap = numbers[:n]
    heapq.heapify(heap)
    for k in numbers[n:]:
        if heap[0] < k:
            heapq.heapreplace(heap, k)
    return heap

>>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8])
[5, 8, 6, 9]

Notes:
>>> help(heapq)
Help on module heapq:
NAME
    heapq - Heap queue algorithm (a.k.a. priority queue).
FILE
    c:\python26\lib\heapq.py
DESCRIPTION
    Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for all k, counting elements from 0. For the sake of comparison, non-existing elements are considered to be infinite. The interesting property of a heap is that a[0] is always its smallest element.

    Usage:

    heap = []            # creates an empty heap
    heappush(heap, item) # pushes a new item on the heap
    item = heappop(heap) # pops the smallest item from the heap
    item = heap[0]       # smallest item on the heap without popping it
    heapify(x)           # transforms list into a heap, in-place, in linear time
    item = heapreplace(heap, item) # pops and returns smallest item, and adds
                                   # new item; the heap size is unchanged

This one differs from textbook heap algorithms as follows:
     - use 0-based indexing. This makes the relationship between the index for a node and the indexes for its children slightly less obvious, but is more suitable since Python uses 0-based indexing.
     - Our heappop() method returns the smallest item, not the largest.

These two make it possible to view the heap as a regular Python list without surprises: heap[0] is the smallest item, and heap.sort() maintains the heap invariant!

For all the function in heapq:
>>> heapq.__all__
['heappush', 'heappop', 'heapify', 'heapreplace', 'merge', 'nlargest', 'nsmallest', 'heappushpop']

    nlargest(n, iterable, key=None) #        Find the n largest elements in a dataset.

Equivalent to: sorted(iterable, key=key, reverse=True)[:n]

nsmallest(n, iterable, key=None) # Find the n smallest elements in a dataset.

# Equivalent to: sorted(iterable, key=key)[:n]

Some useful links:
http://stackoverflow.com/questions/2550624/retrieving-the-top-100-numbers-from-one-hundred-million-of-numbers
http://stackoverflow.com/questions/2486093/millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point/2486341#2486341

♥ ¸¸.•*¨*•♫♪♪♫•*¨*•.¸¸♥

Alg-Code

Pages

Welcome~~~

Wednesday, February 9, 2011

# Retrieving the max top 100 numbers from one hundred million of numbers

No comments:

Post a Comment