Saturday, September 17, 2011

No. 05 - The Least k Numbers


Question: Please find out the least k numbers out of n numbers. For example, if given the 8 numbers 4, 5, 1, 6, 2, 7, 3 and 8, please return the least 4 numbers 1, 2, 3 and 4.

Analysis: The naïve solution is sort the n input numbers increasingly, and the least k numbers should be the first k numbers. Since it needs to sort, its time complexity is O(nlogn). Interviewers will ask us explore more efficient solutions.

Solution 1: O(nlogk) time efficiency, be suitable for data with huge size

A data container with capacity k is firstly created to store the least k numbers, and then a number is read out of the n input numbers at each time.   If the container has less than k numbers, the number read at current round (denoted as num) is inserted into container directly. If it contains k numbers already, num cannot be inserted directly any more. However, it may replace an existing number in the container.  We get the maximum number of the k numbers in the container, and compare it with num. If num is less than the maximum number, we replace the maximum number with num. Otherwise we discard num, since we already have k numbers in the container which are all less than num and it cannot be one of the least k numbers.

Three steps may be required when a number is read and the container is full: The first step is to find the maximum number, secondly we may delete the maximum number, and at last we may insert a new number. The second and third steps are optional, which depend on whether the number read at current round is greater than the maximum number in container or not. If we implement the data container as a binary tree, it costs O(logk) time for these three steps. Therefore, the overall time complexity is O(nlogk) for n input numbers.

We have different choices for the data container. Since we need to get the maximum number out of k numbers, it intuitively might a maximum heap. In a maximum heap, its root is always greater than its children, so it costs O(1) time to get the maximum number. However, it takes O(logk) time to insert and delete a number.

We have to write a lot of code for a maximum heap, and it is too difficult in the dozens-of-minute interview. We can also implement it as a red-black tree. A red-black tree classifies its nodes into red and black categories, and assure that it is somewhat balanced based on a set of rules. Therefore, it costs O(logk) time to find, insert and delete a number. The classes set and multiset in STL are all based on red-black trees. We may use data containers in STL directly if our interviewers are not against it. The following sample code is based on the multiset in STL:

typedef multiset<int, greater<int> >            intSet;
typedef multiset<int, greater<int> >::iterator  setIterator;

void GetLeastNumbers(const vector<int>& data, intSet& leastNumbers, int k)
{
    leastNumbers.clear();

    if(k < 1 || data.size() < k)
        return;

    vector<int>::const_iterator iter = data.begin();
    for(; iter != data.end(); ++ iter)
    {
        if((leastNumbers.size()) < k)
            leastNumbers.insert(*iter);

        else
        {
            setIterator iterGreatest = leastNumbers.begin();

            if(*iter < *(leastNumbers.begin()))
            {
                leastNumbers.erase(iterGreatest);
                leastNumbers.insert(*iter);
            }
        }
    }
}

Solution 2: O(n) time efficiency, be suitable only when we can reorder the input

We can also utilize the function Partition in quick sort to solve this problem with a hypothesis. It assumes that n input numbers are contained in an array. If it takes the k-th number as a pilot to partition the input array, all of numbers less than the k-th number should be at the left side and other greater ones should be at the right side. The k numbers at the left side are the least k numbers after the partition. We can develop the following code according to this solution:

void GetLeastNumbers(int* input, int n, int* output, int k)
{
    if(input == NULL || output == NULL || k > n || n <= 0 || k <= 0)
        return;

    int start = 0;
    int end = n - 1;
    int index = Partition(input, n, start, end);
    while(index != k - 1)
    {
        if(index > k - 1)
        {
            end = index - 1;
            index = Partition(input, n, start, end);
        }
        else
        {
            start = index + 1;
            index = Partition(input, n, start, end);
        }
    }

    for(int i = 0; i < k; ++i)
        output[i] = input[i];
}

Comparison between two solutions

The second solution based on the function Partition costs only O(n) time, so it is more efficient than the first one. However, it has two obvious limitations: One limitation is that it needs to load all input numbers into an array, and the other is that we have to reorder the input numbers.
Even though the first takes more time, the second solution does have the two limitations as the first one. It is not required to reorder the input numbers (data in the code above). We read a number from data at each round, and all write operations are taken in the container leastNumbers. It does not require loading all input number into memory at one time, so it is suitable for huge-size data. Supposing our interview asks us get the least k numbers from a huge-size input. Obviously we cannot load all data with huge size into limited memory at one time. We can read a number from auxiliary space (such as disk) at each round with the first solution, and determine whether we need to insert it into the container leastNumbers. It works once memory can accommodate leastNumbers, so it is especially works when n is huge and k is small.

The characters of these two solutions can be summarized in Table 1:


First Solution
Second Solution
Time complexity
O(n*logk)
O(n)
Reorder input numbers?
No
Yes
Suitable for huge-size data?
Yes
No
Table 1: Pros and cons of two solutions

Since each solution has its own pros and cons, candidates had better to ask interviews for more requirements and details to choose the most suitable solution, including the input data size and whether it is allowed to reorder the input numbers.

The discussion about this problem is included in my book <Coding Interviews: Questions, Analysis & Solutions>, with some revisions. You may find the details of this book on Amazon.com, or Apress.

The author Harry He owns all the rights of this post. If you are going to use part of or the whole of this ariticle in your blog or webpages,  please add a reference to http://codercareer.blogspot.com/. If you are going to use it in your books, please contact me (zhedahht@gmail.com) . Thanks. 

14 comments:

  1. the second solution is not O(n), that is O(n*n).

    ReplyDelete
  2. The second solution is based on the Partition function, which is also the base of quick sort. The complexity of Partition is O(n), whose details is shown in http://en.wikipedia.org/wiki/Quicksort.

    ReplyDelete
  3. for the second approach, O(n) is the expected time complexity and for the worst case complexity, it's O(n^2)

    ReplyDelete
  4. only if you use median of medians you can get O(n) in worst case

    http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

    ReplyDelete
  5. if k << n
    you can use variation of bubble sort with outer loop running k times and the inner loop runs for n times.

    ReplyDelete
  6. private static int getMax(int [] array, int left, int right){
    if (left=rightmax? leftmax:rightmax;
    }
    return array[left];
    }

    public static void findtheleast(int [] array, int number){
    int max=getMax(array,0,array.length-1);
    int [] tmp=new int [max+1];
    for (int i=0;i<array.length;++i){
    tmp[array[i]]++;
    }

    int i=0;

    while (i<=number){
    if (tmp[i]!=0){
    System.out.print(i+",");
    }
    i++;
    }
    System.out.println();
    }

    ReplyDelete
  7. What a wonderful post. I would like to say that this post is really very informational post. I really like it. data analysis service

    ReplyDelete
  8. Web design is a broad term covering many different skills and disciplines that are used in the production and maintenance of websites. You have share such useful algorithm and binary code in this article. The different areas of web design include; graphic design, interface design, authoring; including standardized code and proprietary software, user experience design. Thanks for share this blog. helpful link

    ReplyDelete
  9. You can easily apply it to your furnishings investors and potential investors with information about the organization will allow. Times the cost of sales promotions that you can gain a better investor increases.About apa format reference generator helpful link

    ReplyDelete
  10. If you want to be a professional programmer then the mathematics is the branch which has a great role in making you a good programmer. You can find your own way with your concepts from different books. With the help of dissertation statistics help you can find the help for the qualitative and quantitative analysis.

    ReplyDelete
  11. Data analysis known as analysis of data or data analytics, is a process of inspecting, cleansing, transforming, and modeling data analysis in qualitative research with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Here you provided good research about the latest k numbers data analysis. Thanks

    ReplyDelete