Uniform convergence (combinatorics)
For a class of predicates defined on a set
and a set of samples
, where
, the empirical frequency of
on
is
. The Uniform Convergence Theorem states, roughly,that if
is "simple" and we draw samples independently (with replacement) from
according to a distribution
, then with high probability all the empirical frequency will be close to its expectation, where the expectation is given by
. Here "simple" means that the Vapnik-Chernovenkis dimension of the class
is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
Uniform convergence theorem statement[1]
If is a set of
-valued functions defined on a set
and
is a probability distribution on
then for
and
a positive integer, we have,
-
for some
where, for any ,
-
,
-
and
.
indicates that the probability is taken over
consisting of
i.i.d. draws from the distribution
.
is defined as: For any
-valued functions
over
and
,
-
.
And for any natural number the shattering number
is defined as.
-
.
From the point of Learning Theory one can consider to be the Concept/Hypothesis class defined over the instance set
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer–Shelah lemma
The Sauer–Shelah lemma[2] relates the shattering number to the VC Dimension.
Lemma: , where
is the VC Dimension of the concept class
.
Corollary: .
Proof of uniform convergence theorem [1]
Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing
into the problem of analyzing
, where
and
are i.i.d samples of size
drawn according to the distribution
. One can view
as the original randomly drawn sample of length
, while
may be thought as the testing sample which is used to estimate
.
- Permutation: Since
and
are picked identically and independently, so swapping elements between them will not change the probability distribution on
and
. So, we will try to bound the probability of
for some
by considering the effect of a specific collection of permutations of the joint sample
. Specifically, we consider permutations
which swap
and
in some subset of
. The symbol
means the concatenation of
and
.
- Reduction to a finite class: We can now restrict the function class
to a fixed joint sample and hence, if
has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let for some
and
-
for some
.
Then for ,
.
Proof:
By the triangle inequality,
if and
then
.
Therefore,
-
and
-
and
[since
and
are independent].
Now for fix an
such that
. For this
, we shall show that
-
.
Thus for any ,
and hence
. And hence we perform the first step of our high level idea.
Notice, is a binomial random variable with expectation
and variance
. By Chebyshev's inequality we get,
-
for the mentioned bound on
. Here we use the fact that
for
.
Permutations
Let be the set of all permutations of
that swaps
and
in some subset of
.
Lemma: Let be any subset of
and
any probability distribution on
. Then,
where the expectation is over chosen according to
, and the probability is over
chosen uniformly from
.
Proof:
For any
-
[since coordinate permutations preserve the product distribution
].
-
-
-
[Because
is finite]
-
[The expectation]
-
.
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
-
.
Proof:
Let us define and
which is atmost
. This means there are functions
such that for any
between
and
with
for
.
We see that iff for some
in
satisfies,
.
Hence if we define
if
and
otherwise.
For and
, we have that
iff for some
in
satisfies
. By union bound we get,
-
.
Since, the distribution over the permutations is uniform for each
, so
equals
, with equal probability.
Thus,
-
,
where the probability on the right is over and both the possibilities are equally likely. By Hoeffding's inequality, this is at most
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.