## Find Median from Data Stream

Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.

Examples:

[2,3,4] , the median is 3

[2,3], the median is (2 + 3) / 2 = 2.5

Design a data structure that supports the following two operations:

void addNum(int num) - Add a integer number from the data stream to the data structure.

double findMedian() - Return the median of all elements so far.

For example:

add(1)

add(2)

findMedian() -> 1.5

add(3)

findMedian() -> 2

## Analysis:

The problem give us a data stream instead of a regular data array, and ask to get the median of current data. An obvious but not efficient way of getting median is by either sorting the data ( O( n log n ) time ) or inserting the data in ordered data and ( O( n ) time ). In such scenarios, it's quite slow when findMedian is called very often. Since we have already all the relations between previous data, it would be much more efficient if we could utilize that when adding new data.

Here please refer my previous post (see heap sort) for the basic concept in heap, and it is this very useful data structure that can solve this problem very efficiently ( O( log n ) time to add data, O( 1 ) time to find median).

Specifically in this problem, we still have to figure out how to utilze heap to find the median.

The definition of median can be roughly viewed as the middle one or two values of an **ordered** list. In other words, the ordered list can be divided into two parts, one part contains all the values less than the median, while values in the other part are all greater than the median. No matter the size of list is even or not, **the size of two sub-lists divided by the median must be the same**.

OK, now let's see what a heap can help us. We know that a **min heap** is saving the smallest vaule at root, all the other values are greater than the root. A **max heap** is saving the largest value at root, with all the child nodes smaller than the root.

Thus, a max heap and a min heap now seems perfect fit for the two sub lists divided by the median. Even, for the median, we don't have to know the full order of the sub lists. We only care about how many elements are greater or smaller than the median.

In order to keep the two heap balanced, once we met a new number, intuitively we should add to each heap alternately one by one. However, we don't want to see any numbers in the max heap is greater than any number in min heap. So we have to adjust the heap after each insertion. Sepecifically,

**If current size is even, we try to add the new element to the min heap**- If the new number to be added is greater than the root in min heap. OK, we push the number into min heap.
- However, if the new number to be added is smaller than the largest number in the other heap (root node in max heap), this number should be in the max heap but not the min heap we intend to add to. So we add the new number to max heap instead of min heap. But, we still want to keep both heaps balanced in size. So we pop the top number in max heap (which we added the new element into), and push it into the min heap (which we intended to add one into in this round).

**If current size is odd, we try to add the new element to the max heap**- If the new number to be added is smaller than the root in max heap. OK, we push the number into max heap.
- Similarly, if the new number to be added is greater than the smallest number in the other heap. We stil have to the same procedure according to the above case.

- To get the median, if current data size is even, we choose the root of both heap and take average. If current data size is odd, we just return the root of min heap.

For the implementation, in C++, we can use priority queue, which is implemented using heap, it has methods such as push, pop, and top, and we don't have to worry about the node downshift or deletion in detail. The constructor of priority queue provides arguments to set the std::geater

In python, for simplicity, I use headq module which could apply heap operations on list, e.g.,headq.heappush(h, num) is to push num into list h using heap operation. For max heap, I just take a negative of value and insert into a min heap to simulate a max heap.

## No comments:

## Post a Comment