Four kinds of method to "Select the kth largest or smallest element"
==> to make is O(n)
A simpler formulation of a worst-case O(n) algorithm, is as follows:
- simply use the #Linear general selection algorithm - Median of Medians algorithm described in above sections to find the kth element in O(n) time worst-case
- use the Quicksort#Algorithm partition operation (which is O(n)) from Quicksort to partition into the elements less than and greater than thekth element
ONE: The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the kth place in the end. Thus, if the pivot falls in position k or later, we recur only on the left partition:
function quicksortFirstK(list, left, right, k) if right > left select pivotIndex between left and right pivotNewIndex := partition(list, left, right, pivotIndex) quicksortFirstK(list, left, pivotNewIndex-1, k) if pivotNewIndex < k quicksortFirstK(list, pivotNewIndex+1, right, k)
The resulting algorithm requires an expected time of only O(n + klogk), and is quite efficient in practice, especially if we substitute selection sort when k becomes small relative to n. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm could be used to get better worst-case performance.
TWO: Even better is if we don't require those k items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall entirely before or after the kth place. We recur only into the partition that actually contains the kth element itself.
function quickfindFirstK(list, left, right, k) if right > left select pivotIndex between left and right pivotNewIndex := partition(list, left, right, pivotIndex) if pivotNewIndex > k // new condition quickfindFirstK(list, left, pivotNewIndex-1, k) if pivotNewIndex < k quickfindFirstK(list, pivotNewIndex+1, right, k)
The resulting algorithm requires an expected time of only O(n), which is the best such an algorithm can hope for.
A==>Another simple method is to add each element of the list into an ordered set data structure, such as a heap or self-balancing binary search tree, with at most k elements. Whenever the data structure has more than k elements, we remove the largest element, which can be done in O(logk) time. Each insertion operation also takes O(log k) time, resulting in O(nlog k) time overall.
B==>It is possible to transform the list into a heap in Θ(n) time, and then traverse the heap using a modified Breadth-first search algorithm that places the elements in a Priority Queue (instead of the ordinary queue that is normally used in a BFS), and terminate the scan after traversing exactly k elements. As the queue size remains O(k) throughout the traversal, it would require O(klog k) time to complete, leading to a time bound of O(n + klog k) on this algorithm.
C==>We can achieve an O(log n) time solution using skip lists. Skip lists are sorted data structures that allow insertion, deletion and indexed retrieval in O(log n) time. Thus, for any given percentile, we can insert a new element into (and possibly delete an old element from) the list in O(log n), calculate the corresponding index(es) and finally access the percentile value in O(log n) time. See, for example, this Python-based implementation for calculating running median.