## Finding Percentiles

Suppose you are given a multiset of real numbers $A=\{a_1,a_2,\cdots,a_n\}$. Given a number $b$, one can ask the question : Find the fraction of points in $A$ that are less than or equal to $b$. Here is a simple randomized algorithm to solve this.

 1. Choose $K$ elements of $A$ uniformly at random. Sort the elements. Let $B$ be the sorted multiset. 2. Find out the fraction $g$ of points in the new multiset $B$ that are less than or equal $b$. 3. Output $g$

How good is this algorithm? Here is an attempt at an analysis. For a number $x$ denote by $F(x,A)$ the fraction of elements in $A$ less than or equal to $x$. Suppose $F(b,A) = f$. Suppose $F(b,B) = g$. Then $g|B| = K$ of the elements were chosen from $f|A|$ elements and $(1-g)|B| = (|B| - K)$ of the elements were chosen from $(1-f)|A|$. Hence the probability is $\binom{|B|}{K} \cdot K^f \cdot (|B|-K)^(1-f)$. Therefore $K$ has a binomial distribution $B(|B|,f)$. The expected value of $K$ is $f|B|$ and the standard deviation is $\sqrt{|B|f(1-f)}$. By the The probability that we become wrong by $\delta$ is $Pr[|g - f| > \delta f]$. By using a modified form of the Hoeffding’s bound the probability is upper bounded by $2e^{-\frac{\delta^2 |B| f}{2} }$. If $f > 0$ then how large should $|B|$ be so that we are at most off by 1% of $f$ with probability $0.99$? A simple calculation shows that if $|B| > 1060/f$ then we are good. If The nice part about use of this bound is that it does not depend on the size of the underlying multiset $A$. However there is a dependance on $f$ that is unknown in the first place! However if we assume $f \geq 0.1$ then we can get away with sampling a few thousand elements. Are there any better known algorithms? In randomized algorithms that produce $0-1$ answers it is possible to increase the probability of success by repeating algorithm multiple times and then taking the majority of the output as the true answer. Is there a similar way of improving the overall accuracy of algorithms that produce one of many possible values such as the percentile algorithm?