Medians and Order Statistics

Zhao Cong
  • The order statistic of a set of elements is the smallest element. For example, the minimum of a set of elements is the first order statistic , and the maximum is the order statistic .
  • A median, informally, is the “halfway point” of the set., medians occur at (the lower median) and (the upper median).
  • the selection problem as follows:
    • Input: A set A of (distinct) numbers and an integer , with .
    • Output: The element that is larger than exactly other elements of . # Minimum and maximum
  • An upper bound of comparisons
    1
    2
    3
    4
    5
    6
    MINIMUM(A) 
    1 min = A[1]
    2 for i = 2 to A.length
    3 if min > A[i]
    4 min = A[i]
    5 return min
  • the algorithm MINIMUM is optimal with respect to the number of comparisons performed
  • Simultaneous minimum and maximum:
    • We can find both the minimum and the maximum using at most comparisons
    • We compare pairs of elements from the input first with each other, and then we compare the smaller with the current minimum and the larger to the current maximum, at a cost of 3 comparisons for every 2 elements
    • If n is odd, we set both the minimum and maximum to the value of the first element, and then we process the rest of the elements in pairs.
    • If n is even, we perform 1 comparison on the first 2 elements to determine the initial values of the minimum and maximum,

Selection in expected linear time

  • We present a divide-and-conquer algorithm for the selection problem
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    RANDOMIZED-SELECT(A,p,r,i) 
    1 if p==r
    2 return A[p]
    3 q = RANDOMIZED-PARTITION(A,P,r)
    4 k=q-p+1
    5 if i==k // the pivot value is the answer
    6 return A[q]
    7 elseif i < k
    8 return RANDOMIZED-SELECT(A, p,p-1,i)
    9 else
    10 return RANDOMIZED-SELECT(A,q+1,r,i-k)
  • RANDOMIZED-SELECT works on only one side of the partition
  • its behavior is determined in part by the output of a random-number generator. ### Analysis has elements (all less than or equal to the pivot) with probability . For , we define indicator random variables where and so, assuming that the elements are distinct, we have When the two subarrays on which we might recurse have sizes and . Hence, we have the recurrence Taking expected values, we have Let us consider the expression . We have If is even, each term from to appears exactly twice in the summation, and if is odd, all these terms appear twice and appears once. Thus, we have: Assume that for some constant that satisfies the initial conditions of the recurrence. We assume that for less than some constant. term is bounded from above by this last expression is at most Thus, if we assume that for , then # Selection in worst-case linear time

The SELECT algorithm determines the smallest of an input array of n > 1 distinct elements by executing the following steps

  • Divide the elements of the input array into groups of 5 elements each and at most one group made up of the remaining mod 5 elements.

  • Find the median of each of the groups by first insertion-sorting the elements of each group (of which there are at most 5) and then picking the median from the sorted list of group elements.

  • Use SELECT recursively to find the median of the medians found in step 2. (If there are an even number of medians, then by our convention, is the lower median.)

  • Partition the input array around the median-of-medians using the modified version of PARTITION. Let be one more than the number of elements on the low side of the partition, so that is the smallest element and there are elements on the high side of the partition.

  • If , then return . Otherwise, use SELECT recursively to find the smallest element on the low side if , or the smallest element on the high side if .

Analysis

Thus, at least half of the groups contribute at least 3 elements that are greater than , except for the one group that has fewer than 5 elements if 5 does not divide n exactly, and the one group containing itself. Discounting these two groups, it follows that the number of elements greater than is at least Similarly, at least elements are less than . Thus, in the worst case, step 5 calls SELECT recursively on at most elements.

As previous assumption: which is at most if so - They are not subject to the lower bound because they manage to solve the selection problem without sorting.

On this page
Medians and Order Statistics