Finding Percentiles

March 19, 2008

Suppose you are given a multiset of real numbers A=\{a_1,a_2,\cdots,a_n\}. Given a number b, one can ask the question : Find the fraction of points in A that are less than or equal to b. Here is a simple randomized algorithm to solve this.

1. Choose K elements of A uniformly at random. Sort the elements. Let B be the sorted multiset.
2. Find out the fraction g of points in the new multiset B that are less than or equal b.
3. Output g

How good is this algorithm? Here is an attempt at an analysis. For a number x denote by F(x,A) the fraction of elements in A less than or equal to x. Suppose F(b,A) = f. Suppose F(b,B) = g. Then g|B| = K of the elements were chosen from f|A| elements and (1-g)|B| = (|B| - K) of the elements were chosen from (1-f)|A|. Hence the probability is \binom{|B|}{K} \cdot K^f \cdot (|B|-K)^(1-f). Therefore K has a binomial distribution B(|B|,f). The expected value of K is f|B| and the standard deviation is \sqrt{|B|f(1-f)}. By the The probability that we become wrong by \delta is Pr[|g - f| > \delta f]. By using a modified form of the Hoeffding’s bound the probability is upper bounded by 2e^{-\frac{\delta^2 |B| f}{2} }. If f > 0 then how large should |B| be so that we are at most off by 1% of f with probability 0.99? A simple calculation shows that if |B| > 1060/f then we are good. If The nice part about use of this bound is that it does not depend on the size of the underlying multiset A. However there is a dependance on f that is unknown in the first place! However if we assume f \geq 0.1 then we can get away with sampling a few thousand elements. Are there any better known algorithms? In randomized algorithms that produce 0-1 answers it is possible to increase the probability of success by repeating algorithm multiple times and then taking the majority of the output as the true answer. Is there a similar way of improving the overall accuracy of algorithms that produce one of many possible values such as the percentile algorithm?

Superconcentrators From Expanders

March 7, 2008

Today I will be blogging about the construction of superconcentrators from expanders. This is directly from the expanders survey. I am just reviewing it here for my own learning. First some definitions.
Definition 1. A d- regular bipartitie graph (L,R,E), with |L|=n and |R| =m is called a \emph{magical graph} if it saisfies the properties below. For a given set S of vertices, we denote the set of vertices to which some vertex in S is connected as \Gamma(S).

1. For every S with |S| \leq \frac{n}{10}, |\Gamma(S)| \geq \frac{5d}{8} |S|
2. For every S with \frac{n}{10} < |S| \leq \frac{n}{2}, |\Gamma(S)| \geq |S|.

Definition 2. A superconcentrator is a graph G=(V,E) with two given subsets I, O \subseteq S with |I|=|O|=n, such that for every S \subseteq I and T \subseteq O with |S|=|T|=k, the number of disjoint paths from S to T is at least k.

Superconcentrators with O(n) edges are interesting for various reasons which we do not go into here.

But we do give a construction of a superconcentrator with O(n) edges from magical graphs above. A simple probabilistic argument can show the following result.

Theorem There exists a constant n_0 such that for every d \geq 32, such that n \geq m,n_0 \geq \frac{3n}{4}, there is a magical graph with |L|=n,|R|=m.

Here is the construction of a superconcentrator from magical graphs. Assume that we can construct a superconcentrator with O(m) edges for every m \leq (n-1). The construction is recursive. First take two copies of the magical graphs (L_1,R_1,E_1), (L_2,R_2,E_2) with |L_1|=|L_2|=n and |R_1|=|R_2|=\frac{3n}{4}. Connect every vertex of L_1 to the corresponding vertex of L_2 and add edges between R_1=I and R_2=O so that the graph becomes a superconcentrator with the size of the input vertex set as \frac{3n}{4}. We claim that the resulting graph is a superconcentrator with input vertex set of size n.

Identify the input vertex set as L_1 and the output vertex set as L_2. For every | S \subseteq L_1 | = k \leq \frac{n}{2} it is true that | \Gamma(S) \cap R_1 | \geq |S|. Therefore by Halls Marriage Theorem there exists a perfect matching between vertices of S and \Gamma(S) \cap R_1. Similarly there exists a perfect matching between vertices of T \subseteq L_2 and \Gamma(T) \cap R_2. Together with the edges of the superconcentrator between input set R_1 and output set R_2 there are at least k disjoint paths between S and T. It remains to handle the case of k > \frac{n}{2}. For |S|=|T|=k > \frac{n}{2}, there is a subset U \subseteq S of vertices with |U| \geq (k - \frac{n}{2}) such that the vertices corresponding to U in L_2 are in T. Edges between such vertices contribute to (k- \frac{n}{2}) disjoint paths between S and T. The remaining \frac{n}{2} disjoint paths exist as proved earlier. Hence we have a superconcentrator with input, output vertex sets of size n. How many edges are there in this graph? For the base case of this recursion, we let a superconcentrator with input output sets of size n \leq n_0 be the complete bipartite graph with n_0 ^2 edges. The following recursion counts the edges. Let e(k) denote the number of edges in the superconcentrator as per this construction with input, output sets of size k. Then

e(k) = k^2 for k \leq n_0 and e(k) = k + 2dk + e(\frac{3k}{4}) for k > n_0. It can be easily seen that e(n) \leq Cn for C \geq max(n_0,4(2d+1)).

Computing Minimum Spanning Trees With Uncertainty

February 28, 2008

In this post we will partially discuss this paper by Erlebach et. al. The classical MST problem is studied for graphs with edge and vertex uncertainties. In general a problem with uncertainties can be specified as a triple (C,A,\phi) where C, the \emph{configuration} for the problem consists of a set of data points {c_1,c_2,\cdots,c_n} and A={A_1,A_2,\cdots,A_n} is a set of \emph{areas} where c_i \in A_i. The function \phi defines the solution for a particular configuration C. The input of the problem is just the set of areas A. An algorithm can \emph{update} any of the A_i and this determines the value of the data point c_i. The objective is to minimize the number of such updates and still be able to compute \phi(C). An algorithm is k update competitive if it can compute \phi(C) with at most k\cdot OPT+c updates where OPT is the minimum number of updates required by any algorithm.

Example: Given a weighted graph G=(V,E,w) the areas A_e, e \in E can be open intervals such that the edge weight w(e) \in A_e. The function \phi(C) can then be a minimum spanning tree of G.

The \emph{Witness} algorithm in this paper gives a general strategy to solve problems with uncertainties. The algorithm is k update competitive where k is an upper bound on the size of a witness set W. A witness set W is a set of areas such that in order to verify the solution \phi(C) at least one element of W needs to be updated. The witness algorithm is a kind of template for problems with uncertainties as it does not specify how a witness set is to be computed.

Question : In general is it always possible to compute a witness set in polynomial time?

The problem discussed in this paper is an extension of the concept of MST to graphs with uncertain edge weights. A MST of such a graph is said to be a MST in all possible realizations of edge weights, that satisfy the basic condition of the edge weight belonging to its assigned area. The paper provides an algorithm that is 2 update competitive when the areas are open intervals or singletons. Furthermore it is proved that, there is no (2-\epsilon) update competitive algorithm in this case. When there are no restrictions on the areas at all, it is proved that there is no constant update algorithm for the MST problem. The paper also considers graphs with vertex position uncertainties [where vertices are points in an Euclidean space].

The algorithm looks a lot like Kruskals MST algorithm. The main crux it to find witness sets of size 2 as edges are considered in their special sorted order. For details please refer to the paper.

Some comments and questions:

1) How about approximation algorithms for problems with uncertainties? The goal of an approximation algorithm would be to compute an approximate solution to problems with uncertainties with minimum updates.

2) In this model, there seems to be no restriction on the running time of the algorithm but the only importance seems to be given to the number of updates. Is there a model which combines the two meaningfully?

3) The notion of a MST that applies to all possible realizations of a graph with edge uncertainties seems too restrictive.