This course is a formal introduction to data analysis rather than just
a collection of things you can do with data. Most data you will
encounter are produced by some instrument designed for purpose and
the basic philosophy I will present is
that a very large fraction of data analysis can be described using the
same formalism involving an instrument response coupled with
systematic errors and noise.
Such a formal approach has advantages. When confronted by a data analysis
problem you should try and describe it using the recipe I will provide.
Having done this the solution may be immediately apparent, using a
well established technique or some method developed in another
discipline but equally applicable in the situation presented by your
data. Even if the solution is not so obvious the various methods which
might be applicable will be clear. It is often the case that
new data sets are very similar to old except that some feature of the
system is slighly different. Established methods may be applicable if
you make some minor adjustment or some acceptable approximation.
If all this fails you will have a succinct statement of your problem
on which to base a novel solution. You will be well placed to
invent your own data analysis techniques.
In general an instrument responds to a function in some input space
and produces a data set in some output space. For example the objective
of a camera takes the angular distribution of light falling on the
aperture and converts it into an intensity distribution over the focal plane.
In turn this distribution is converted into coloured dyes over the surface
of a plastic film. The input and output space are VERY different
but if the camera and film are well made and correctly used there is
a tight relationship between the original light distribution and
the processed result.
The response of a large class of instruments can be modelled using
an integral equation of the form:
g(x¢,y¢,z¢) =
ó õ
ó õ
ó õ
f(x,y,z) h(x¢,y¢,z¢,x,y,z) dx dy dz
f(x,y,z) is a piecewise continuous function over the input space and
g(x¢,y¢,z¢) is a continuous function over some output space.
h(x¢,y¢,z¢,x,y,z) is the response function and in general
depends on both the input and the output space. For simplicity we shall
consider a 1-D system:
g(x¢) =
ó õ
f(x) h (x¢,x) dx
The extension to higher dimensions is normally straight forward but
requires care and attention to details.
Note that if the input is an impulse f(x)=Ad(x-a) then using
the properties of the Dirac delta function:
g(x¢) = A h(x¢,a)
The response function can be thought of as a normalised version of
the output we expect from a
perfect impulse input. In principle we can measure the response function
h(x¢,a) by stimulating the system with an impulse input at x=a.
A perfect instrument would have a response:
h(x¢,x) = C d( x¢-x)
where C is just a scaling constant.
A linear instrument has a response such that if
f1 Þ g1 and f2 Þ g2 then
f1 + f2 Þ g1 + g2
So, for instance, if we double the input level the output level is
also doubled.
An isoplanatic system has a response which is only a function of
the difference between the input and output coordinates:
h(x¢,x) = h(x¢- x)
This means that the shape or form of the response is the same for every
input position. In this case the system equation becomes a convolution
of faltung integral:
g(x¢) =
ó õ
f(x) h(x¢-x) dx
Although in most cases the underlying response of the instrument is properly
modelled by an integral equation involving continuous functions the use
of computers and digital techniques naturally leads to a digitization
of the response. The input and output functions become vectors (or more
generally matrices and tensors) and the integration becomes a summation:
gi =
å
j
hij fj
We must be careful that this representation is a good approximation to the
true response by proper choice of the samples in the input and output space.
If the function h(x¢,x) changes rapidly as a function of x or x¢ then
we must use a large number of samples across the change. We must
also make sure that all finite (significant) contributions to the
integral are included in the range covered by i and j.
Now the instrument response has become a RESPONSE MATRIX. It
is usually only an approximation to reality because of
numerical integration errors. If the
response is linear the elements of hji are constants and if it is
isoplanatic then each row (or column) of the matrix is just a circular
permutation of the first row (or column).
æ ç ç ç
ç ç è
1
2
3
4
4
1
2
3
3
4
1
2
2
3
4
1
ö ÷ ÷ ÷
÷ ÷ ø
Because of this the matrix multiplication performs an approximation
to a circular convolution. To avoid wrap-around (aliasing)
problems you must pad out the rows (or columns) with zeros.
If the instrument involves more than 1 dimension then the input and output
matrices (or tensors) can be stacked to form vectors. The response will
still be a matrix but will have a complicated banded structure representing
the coupling or blurring between the different coordinates. So the
simple digitized system equation is still valid but the interpretation of the
indices i and j is rather complicated. If the
the system is isoplanatic and the response is separable:
h(x¢,y¢,x,y)=hx(x¢-x)hy(y¢-y)
and the response matrix has a so called Toeplitz form such that
hij=hmn if i-j=m-n. If the system response is isoplanatic but not
separable then the stacked response matrix has a block Toeplitz form.
Whether we model our instrument using an integral equation or a summation
of discrete elements it is always an idealisation. The real world includes
something extra in the data that is not included in our prescription.
These are, of course, the dreaded errors or instrument noise
hated by undergraduates.
Actually it is these errors or the noise which
makes data analysis interesting and a challenge.
Systematic errors are generally things which we could/should have included
in the system response but couldn't/didn't. They can be eliminated or at
least reduced by calibration.
Noise is something which is inherent in the processes involved in the
measurement system but which is not amenable to the instrument response
prescription. It can't be included in the response function but must
be allowed for in data analysis.
It is conventional to include both systematic errors and noise
as an additive term in the response equation:
gi =
å
j
hij fj + ni
If each ni is independent of its neighbours and independent
of the input fj then the noise is said to be uncorrelated.
In many cases this is a good approximation. Note that this does
not mean that ni is independent of i. The noise can still be
a function of position in the output space and uncorrelated.
In many modern instruments the noise arises directly
from the inherent statistical nature of the processes involved. This can
be thermal noise, for instance in an electrical resistance, or counting
statistics when detecting individual particles, electrons, photons,
neutrons etc..
The noise can come from the source itself in which case f(x) is
essentially a probability density distribution or fj are probabilities
if the samples are essentially integrals over increments in x.
The noise can be instrumental in which case h(x¢,x) (or hij)
describes the mean behaviour of the system and ni is
something which is added by the measurement process.
In many cases we can think of the output function g(x¢) as a probability
distribution (or gi as probabilities) and then ni (or at least
a component of ni) is just
characteristic of the output of the instrument.
In any particle counting experiment we must encounter counting
statistics. gi will then be the mean number of particles expected
in a given sample x ® x+Dx
over a given time interval. If the particles are
independent (i.e. not bunched or anti-bunched) then the probability
of detecting m=0,1,2,¼ events (particles) will be governed by the
Poisson distribution:
p(m) =
lm exp(-l)
m!
where l = m is the mean number detected (which is gi in
our description above). The variance of the Poisson
distribution is s2=l so it will depend on gi
and is NOT independent of the object distribution. The output of a
particle counting experiment will also contain spurious or background
events which are not related to the desired signal. These must be
included in ni and appear as an additive term in the mean of the Poisson
distribution so gi=li=si+bi where si is the mean
signal count expected in the ith sample and bi is the mean background
count. The variance is then given by the total mean count
si2=si+bi.
The Poisson distribution
is a limiting case of the Binomial distribution when the probability
q of detecting an event in a single trial is very small. If we make
N trials then:
p(m)=
N!
m!(N-m)!
qm(1-q)N-m
The mean of this Binomial distribution is m = Nq and the variance is
s2=Nq(1-q).
Finally if N is very large and neither q or 1-q is too small then
the binomial distribution becomes a continuous probability
density, the normal or Gaussian distribution:
p(z)=
1
Ö
2p
exp(-
z2
2
)
where z=[(m-Nq)/(Ö{Nq(1-q)})] a standardized continuous variable.
It is of course the normal distribution that we usually use to model
the additive noise ni in experiments which don't involve counting
events:
p(ni)=
1
si
Ö
2p
exp
-(ni-mi)2
2si2
where mi and si2 are the mean and variance of the noise
in the ith sample.
In many problems it is necessary to consider a statistical model for the
source and noise processes using moments or expectation values to
characterise the underlying probability density function. In discrete
notation using the sampled functions the first two moments, the mean
and covariance are
uf= < f >
[ff] = < (f-uf)
~
(f-uf)
>
where < > denotes the expectation value and [()\tilde] the transpose.
uf represents the mean as a function of position and
[ff] describes the correlation between different positions.
A stochastic process is said to be stationary if uf
is a constant vector and [ff] has a Toeplitz form (or block
Toeplitz if we are dealing with a stacked set of multiple dimensions). This
means that the correlation function of the process only depends on the
difference in position rather than the absolute position.
A matrix closely related to the above moments is the correlation matrix
we can state the problem posed in data analysis; given a set of
measurements gi what can we say about the source vector fj
or the source function f(x)?
If the noise function n(x¢) is independent of the unknown source function
f(x) then we are trying to solve a Fredholm equation of the first kind.
If n(x¢) is a function of f(x) then the integral equation is said to
be a Fredholm equation of the second kind. In either case it is important
to remember that in almost all experiments the discrete notation is an
approximation to an integral equation.
If the instrument response is a complicated transformation (a Fourier
transform say) or if the response is rather poor, then the
major problem is to calculate a good estimate of fj given
gi. This is usually called an inversion problem. How easy
the inversion is depends on the form of the response matrix and the
noise level. Good examples of inversion problems are aperture synthesis
in radio astronomy or the correction for the optical
blurring in the Hubble telescope. When performing an inversion the central
question is how reliable is the result? What features of
[^(f)] can we believe and which are due to noise?
If you are lucky (or if you don't make a mistake manufacturing the
primary mirror) then the instrument response is good and it is
easy to estimate fj given the data gi. Then the problem
shifts to another plane. What can we deduce about the processes that
produce f(x) given the measurements gi Þ fj?
In this case we can wonder about the physical system beyond the
immediate instrument and the problem is a modelling problem.
Although the response hij may not present any difficulty the noise ni
will still be a limiting factor. In this case we require to find good
estimates of the model parameters and confidence limits on these parameters.
by pre-multiplying both sides by the inverse of the
response matrix
[h]-1g=[h]-1[h]f+[h]-1n
If [h]-1 exists, [h]-1[h]=[I] the unit matrix and if there
is no noise n=0 then this procedure will recover the original
source vector
[h]-1g=f
Unfortunately the inverse [h]-1 is usually ill-conditioned (some of the
elements of [h]-1 are very large because they are formed by division
by very small numbers) or doesn't exist at all (the determinant is zero).
When we attempt the inversion the noise term [h]-1n blows
up and dominates. What is needed is some pseudo inverse matrix which
is optimized to recover as much of the source vector as possible in the
presence of noise. In practice this means supressing the ill-conditioned
elements in the true inverse [h]-1 and trying to produce a best
estimate of the source vector [^(f)].
Apart from the limitations imposed by noise most data sets are very large
and the instrument response matrix is therefore rather large although it
may be fairly sparse. The accurate inversion of such large matrices is difficult
and time consuming and in many cases is not practical.
Least squares or maximum likelihood fitting can yield model parameter
values, c1,c2,¼ which are is some way the best estimates of the
true values given the data. However on their own these methods can't tell
us, firstly, whether or not the model is a good interpretation (fit) of the
data and, secondly, what confidence limits (errors) we should assign
to the parameter values found.
The workhorse of methods for testing goodness of fit is the Chi-squared
statistic. This can be used to fit a probability distribution
to a measured data set,
to test the goodness of fit in some fitting procedure, to yield confidence
limits for derived parameters and forms the basis of a fitting procedure in
its own right.
Chi-squared is based on the hypothesis that the optimum description
of a set of data is that which minimises the weighted sum of
the squares of the differences between the data values and predicted
values. The variance of the fit is given by
v2=
1
N-k
å
wi ( gi-
^
g
i
)2
where N is the number of data points, k is the number of parameters
c1,c2,¼,
[^g]i=y(x,c1,c2,¼) some fitting function and wi
is a weighting factor for each data point:
wi =
1
s2i
1
N
å
1
s2i
This is a measure of the uncertainty on each data point normalized to
the average uncertainty of all the points. Note that if all the data
variances are constant s2i=s2 then wi=1 and v2
is simply a sum of square differences divided by N-k.
Chi-squared is defined as
c2=
å
1
s2i
(gi-
^
g
i
)2
So
c2
N-k
=
v2
< s2i >
=c2n
where n = N-k is the number of degrees of freedom and
< s2i > = [
1
N
å
1
s2i
]-1
the weighted sum of the variances. c2n is called the reduced
Chi-squared.
If the fit is good we expect that v2 will be close to the underlying
variance of the parent process and c2n » 1. If
c2n is too large then the fit variance must be much larger than
the expected value and the fit is poor. If c2n is too small
then you may have under estimated the errors or included
too many parameters in the fit.
We can attempt to quantify the goodness of fit using the expected
probability density distribution of c2=z2
p(z2,n)=
(z2)[(n-2)/2]exp
-z2
2
2[(n)/2] G(
n
2
)
The integral of this function between z2=c2 and z2=¥
P(c2,n) =
ó õ
¥
c2
p(z2,n) d z2
can be found tabulated in many statistics or data analysis books.
So if you search for parameter values c1,c2,¼ which produce
the global minimum c2 you can estimate the probability that
the c2min value found exceeds that value. That is if you performed
the experiment an infinite number of times then some fraction of the
data sets (fraction=P(c2,n)) is expected to yield a resulting
c2 larger than the one found from the actual data set.
Unfortunately c2 combines two factors, how good the underlying
function y(x,c1,c2,¼) describes the data AND the deviations
between the best fit function found and the data set. For example
you could achieve a c2n » 1 if the error estimates
were too large and the function was a fairly bad fit, or the error estimates
were too small and too many parameters were used to produce the fit.
The formal development and justification for the c2 fitting procedure
outlined above is rather involved. This is discussed in some detail by
Lampton, Margon and Bowyer (1976) [14]. In the definition cited
c2=
å
1
s2i
(gi-
^
g
i
)2
the statistic will only be the true Chi-squared if the quantities
[^g]i and s2i are the true population mean and
variance values. In any experiment these are only known to nature and
not the experimenter. In the idealized instance that they are the true
values each term in the summation
will be a sample of a random variable with zero mean and unit variance and
in the limit of a very large number of samples (N® ¥)
the statistic
will be normally distributed. When the number of samples is small (or finite)
the statistic has the sampling probability distribution of c2true
with N degrees of freedom n = N as quoted above.
What we actually calculate is a c2min using a fitting function
to yield [^g]i. It can be shown that this sampling statistic
is distributed as c2 with n = N-k degrees of freedom when using
k functional parameters. This minimization provides the best fit
values for the parameters c1,c2,¼ and the c2min
value.
Analysis shows that the difference between the c2true and
c2min will also have a sampling distribution of c2
but with k degrees of freedom
DS = c2true-c2min=c2k
This difference is independent of the c2true value and the number
of data points and only depends on the number of parameters used to generate
the c2min.
What we want is the definition of a confidence region or volume in the
parameter space so that we can assign errors or confidence ranges
to the parameters c1,c2,¼. As we vary the parameter values
about the minimum so we expect the Chi-squared to increase. If the increase
exceeds some limiting value then the probability that the parameters
are correct is small ( < 10% say). The limiting value of the increase is
c2limit given by
P(climit2,k) =
ó õ
¥
c2limit
p(z2,k) dz2 = 0.90
where k is the number of parameters we are varying. So if we fix one
of the k parameters
at a series of values about the minimum and for each of
these fixed values we find the
minimum Chi-squared by varying all the remaining parameters
we can find the 90% confidence interval for that one parameter.
We can do this for each parameter in turn to define the complete confidence
region. This procedure ensures the independence of all the fitting parameters.
It is not the same as fixing all the parameters at their best
fit values and then varying just one parameter at a time. If this is done
then 0.90=P(climit2,1) (i.e. k=1)
and the confidence limits for each parameter are much smaller.
In doing this you are not allowing for the interplay between the
parameters in the fitting procedure. Actually
if there are a large number of parameters it may be legitimate to
assume constant values for
some of the parameters and not consider them in
the error analysis but just fix them at their best fit or some other
values. This
reduces the number k and the volume of the confidence region.
However you must be consistent in this designation. Once a parameter is
assumed constant you can't change your mind later, vary this parameter
and start quoting confidence limits for it.
Although the Dc2 above is independent of c2min
the procedure is only sensible if the initial c2min is
reasonable. In the above we have implicitly assumed that
c2min » c2true. There is no point in calculating
confidence limits for a model which is poor to start with or if the
error estimates are wrong.
It is usually possible to lower the c2min value by introducing
more parameters into the fit. So if the fit is poor we can include a new
feature into the fitting function to try and model the discrepancy.
It is therefore legitimate to ask at what
point is the introduction of another parameter no longer justified by the
data? This is the purpose of the so called F-test.
If we take the ratio of two Chi-squared statistics then the sampling
probability distribution is the F distribution
pf(z,n1,n2) =
c21/n1
c22/n2
which has a functional form
pf(z,n1,n2) =
G(n1/2+n2/2)
G(n1/2)G(n2/2)
æ è
n1
n2
ö ø
n1/2
z1/2(n1-2)
(1+zn1/n2)1/2(n1+n2)
When we construct such a ratio it is the integral of the distribution
which is of most interest
P(F,n1,n2) =
ó õ
¥
F
pf(z,n1,n2)dz
So to test for the necessity of an additional fitting parameter (or parameters)
we take the ratio of the minimum Chi-squared values obtained with and
without the new parameters. We then look up this F value in a tabulated
form of the integrated probability above using the appropriate n1
and n2 values. If the ratio is larger (or smaller depending
on which way up we take the ratio) than that expected at a given level
of confidence then the inclusion of the new parameters is significant
at that level. Note that P(F12,n1,n2) is not the same
as P(F21,n2,n1) unless n1 and n2 are rather
large so we should test both the ratio and its reciprocal taking
care with the order of the n1 and n2 values.
If the inclusion of some parameters is not justified by a significant
improvement in Chi-squared as defined by the F-test then there is no
point in deriving confidence limits for those parameters or indeed even
including them in the fit. Such parameters should be removed or fixed
in the fitting procedure.
7 Linear, isoplanatic systems,
stationary processes and Fourier filtering
The linear, isoplanatic system has a system response matrix with the
Toeplitz (circulant) form. A circulant matrix is diagonalized by the
Discrete Fourier Transform (DFT). Similarly the block Toeplitz form
generated by stacking more than 1 dimensions is diagonalized by the
multidimensional DFT. The elements of the 1-D DFT matrix are
tkj=exp(i2pkj /l)
tkj-1=exp(-i2pkj /l)/l
where i=Ö{-1} and l is the number of rows and columns.
The properties of the complex exponential function lead to the Fast
Fourier Transform algorithm (FFT) and so matrix diagonalisation
(and inversion) for such a system can be performed very quickly.
Note that diagonalisation of the linear, isoplanatic system
by the DFT can be thought of as the
discrete convolution theorem the digital analogue of the
familiar convolution theorem which holds for the Fourier transform.
Stationary processes have a covariance matrix with a Toeplitz form
which is diagonalized by the DFT. By definition, the diagonal
elements of the resultant matrix [Lf] form the
power spectrum of the stochastic process. (When dealing
with continuous functions rather than vectors the power spectrum is the Fourier
transform of the autocorrelation function.)
Discrete Fourier filtering is carried out by taking the DFT of the data
vector
G=[t]-1g
multiplying each element by a filter weighting
G¢=WG
and then taking the inverse transformation
g¢=[t] G¢
If we construct a diagonal matrix from the weighting (filter) vector
Lii=Wi
we can define a matrix
[w]=[t][L][t]-1
and write the operation in the real domain as a matrix multiplication
g¢=[w]g
A consequence of the above is that if we decide to perform Fourier
filtering on our data then we are implicitly treating the underlying
processes as stationary and we are performing a convolution in the
real domain. A Fourier filter treats all areas of the data plane equally; it
performs the same correlation everywhere in the data space so it
cannot adapt to changes in the noise or signal correlations or the
instrument response as a function of position.
If we consider the inversion problem
an estimate of the original source vector [^(f)] is required. The
error between this estimate and the true vector is:
e=f-
^
f
Using the principle of least squares we want to minimize the mean square
error
min <
~
e
e >
where [(e)\tilde] is the transpose of the error vector and
< > indicates the expectation value of the product.
We want a matrix operator or filter [w] such that
^
f
=[w]g
Using the system equation to substitute for g and minimizing
the square error yields:
[w]=[cf]
~
[h]
([h][cf]
~
[h]
+ [cn])-1
where [cf] and cn] are the correlation matrices of the source
and the noise
[cf]= < f
~
f
>
[cn]= < n
~
n
>
and the noise is assumed to be independent of the source
[cnf]= < n
~
f
> = [cfn]=[0]
This form of filter [w] was originally applied to time series by Wiener,
1949 [29] and is therefore called a Wiener Filter. It
has been used successfully in image processing and other areas of data
analysis following the work of Helstrom (1967).
It is only easy to implement the Wiener filter if a diagonalization
transformation exists so it has mostly been used as a Fourier filter
[t]-1
^
f
= [ Lf][Lh*] ([Lh][Lf][Lh*]+[Ln])-1[t]-1g
where the diagonal elements of [Lh] are the DFT of the
response function and those of [Lf] and [Ln] are
the power spectra of the source and noise processes (assumed to be
stationary).
The Wiener filter constructed in this way attempts to restore the best
estimate of the source vector (function) assuming a particular simple
form for the system response, source statistics and noise.
Perhaps
a more familiar application of the Principle of least squares is
model fitting.
Suppose the estimate [^f](x) (or if you prefer
[^(f)]) is modelled
by a function y(x,c1,c2,¼) where c1,c2,¼ are
unknown parameters or coefficients which are often physical attributes of
the source under study, temperature, density, etc.. Such a source function
would lead to an estimated data function
^
g
(x¢)=
ó õ
y(x,c1,c2,¼) h(x¢,x) dx
The error function (vector) is then
e = g-
^
g
and we can again search for
min <
~
e
e >
varying the parameters c1,c2,¼ to find the best solution.
For example if the instrument response is perfect (a delta function)
and y=c1+c2x we get the familiar linear regression formulae
c1=
å
i
gi
å
i
xi2-
å
i
xi
å
i
xi gi
N
å
i
xi2 -(
å
i
xi)2
c2=
N
å
i
xi gi -
å
i
xi
å
i
gi
N
å
i
xi2 -(
å
i
xi)2
where N is the number of data points.
Similar expressions can be found for the coefficients of
higher order polynominals but if y is a complicated function or
if the system response is not perfect then we must resort to
some form of searching algorithm to find the least squares solution.
Any set of observations of some variable taken at different times is
called a time series. The times are usually equally spaced and form
a continuous sequence but can be unequally spaced and often contain
gaps because of viewing constraints or exposure problems.
Each sample can be a snapshot
of the variable or some form of average value taken over a
time sample. In most cases the measurement instrument does not blur
or mix times so that the response hij is diagonal with gaps in the
data represented by missing elements along the diagonal. In some cases
the variable we measure is a summation of different components which
suffer different time delays and then off-axis elements in hij
will represent the time delays.
The analysis of time series is the subject of many chapters in books on
statistics and signal processing.
In general the movements or fluctuations seen are classified into types
Trends or long term secular movements
Periodic or cyclical variations
Irregular or random fluctuations
So the usual problem of time series analysis is trying to extract these
features from the data and characterise them rather than deconvolving
the instrument response.
We can model the time series as a sample of a continuous function
g(t) multiplied by a window function to represent gaps or exposure variations.
gi=g(ti) w(ti)
As with all data anaylsis, superimposed on top
of the underlying behaviour, be it periodic, long term trend or irregular,
is noise, so the sampling by the instrument is described by an integral equation
of the form
g(ti)=
ó õ
+¥
-¥
h(ti,t) f(t) dt +n(ti)
If this process is effectively instantaneous then h(ti,t)=d(ti-t)
but most instruments will integrate over some small time interval so that
h(ti,t)=h(ti-t) where h(ti-t)=1 for
0 < ti-t < t and h(ti-t)=0 otherwise. Providing t is smaller
than the time between samples then each gi will be independent of its
neighbours.
Within this model we want to characterise f(t) give gi.
Trend analysis is best done by linear regression, least squares fitting
or Chi-squared fitting as described above. A Chi-squared test of the
hypothesis that f(t) is constant is a good way of seeing whether there
is evidence for variability of any sort in the data although it can
be fooled if there is a very weak periodic signal present as you will see
below.
If the samples are equally spaced at Dt
we can look for periodic behaviour using the discrete Fourier transform
(DFT) as described under the section on linear, isoplanatic systems.
G = [t]-1g
where for a vector of length l the forward transform matrix is
t-1kj=
1
l
exp(-i 2 pkj/l)
This expresses the vector g as the linear sum of Fourier
components (eigenvectors) with each Gi corresponding to the
frequency
ni =
i
l Dt
where i=0,1,2,3,¼,l-1 and l is the total number of samples. Gi is
complex but if gi are real (which is usually the case) then only the
first l/2+1 samples of Gi will be required. The highest frequency
in the data is represented by i=l/2 with a frequency of
n = 1/(2Dt) the Nyquist frequency (2 samples per period).
The remaining samples
i=l/2+1® l-1 are a mirror image of the low frequency samples
Gi=Gl-i for i=0® l/2-1. The real part of Gi
represents the cosine (symmetric) components and the imaginary part
the sine (antisymmetric) components. (Note the symmetry is about i=0
not i=l/2.)
Any periodic component in gi will appear as a series of peaks in the
vector Gi, at the fundamental frequency and the harmonics
(integer multiples of the fundamental). The
relative amplitudes of these peaks will depend on the shape of the
periodic function. The amplitude of the Fourier components will be
given by Ö(Gi Gi*) and the phase angle by
tan-1(imag(Gi)/real(Gi)).
There are a number of problems associated with period searching using the
DFT. The first is the noise. If the time samples don't overlap
the vector Gi is the linear sum of the DFT of fi and the DFT of
the noise vector ni.
Gi=Fi+Ni
The power spectrum of the data set (square of the amplitude of Fourier
components) is therefore given by
GiGi*=(Fi+Ni)(Fi*+Ni*) = FiFi*+NiNi*+FiNi*+NiFi*
The first two terms on the righthand side are the power spectrum of the
signal and noise respectively. If the noise is uncorrelated with the signal
then (by definition) the second two cross-terms will be zero. So the data
power spectrum is usually a simple sum of the signal and noise power
spectra.
By the convolution theorem these power spectra are the Fourier transforms
of the auto-correlation functions of the time series of the signal and the
noise. If
cf(t¢)=
ó õ
f(t) f(t¢+t) dt
then
Cf(n) =
1
Ö(2 p)
ó õ
cf(t¢)exp(-i 2 pnt¢) dt¢=Ö(2 p) F(n) F*(n)
Notice that the complex conjugation has been introduced since we are taking
an auto-correlation not a convolution, (t¢-t ® t¢+t).
Or in discrete notation
cf = [fc] f
where [fc] is a circulant matrix generated from f (see
earlier section on modelling the instrument response).
The circulant for a crosscorrelation is the transpose of that required
for a convolution. Then
Cf = [t]-1cf = l([t]-1f)([t]-1f)*=l FF*
where l is the number of elements in the vector f.
Actually multiplication by the circulant form [fc] performs a
discrete circular convolution and the vectors must be padded out with zeros
to render the discrete result equivalent to the continuous form. A proof
of the above and some discussion of these details is given by Hunt (1971).
If the noise is white then there is no correlation between the noise in
different data samples and the auto-correlation function of the noise will be
a delta function. The power spectrum of the noise will then be a constant
since
d(t)=
1
2 p
ó õ
¥
-¥
exp(-i 2 pnt) dn
In practice the data vector (and hence the noise vector n)
only contain a finite number of samples so the sample auto-correlation function
of the noise vector is approximately a delta function with residual wings.
Likewise the noise power spectrum will be flat but will contain residual
fluctuations. If the noise is Gaussian with standard deviation s
then the mean power in the DFT spectrum
NN* is s2/l. Any mean offset in the
noise will only appear in the N0 term (sometimes referred to as the
DC term by analogy with AC theory). The power samples are distributed as
Chi-squared with 2 degrees of freedom (because they are the sum of the
squares of the real and imaginary parts of the spectrum, see for example
`An introduction to spectral analysis' J.N.Rayner). If the noise
is Poissonian (counting statistics) with a constant mean then
in the limit of large l the power
samples again are distributed as Chi-squared with 2 degrees of freedom
(see D.A.Leahy et al 1982). It should be noted that
because the Nyquist term (i=l/2) contains
only a real component this is distributed as Chi-squared with
1 degree of freedom.
If the signal process f(t) has a stochastic component then this will
also add to the mean level per frequency sample but this is seldom
white and hence the FF* will not be flat.
Any periodic component of f(t) will appear as peaks superimposed
on the continuum. The significance of such peaks will be determined by the
Chi-squared distribution of the continuum fluctuations. In order to
normalise these fluctuations the power should be multiplied by 2l/s2
so that the mean value of the Chi-squared distribution is 2 as expected.
Analysis of power spectra is complicated by the presence of a window function.
Since the continuous function g(t) is multiplied by the window function
w(t) in the real domain the spectrum of the signal is convolved by
the spectrum of the window function in the Fourier domain. In discrete
notation we have
G¢ = [Wc] G
where [Wc] is a circulant matrix constructed from the DFT of
the window function W. In the trivial case of continuous
coverage the circulant will be diagonal because W is
a delta function. However if there are missing real domain
samples the convolution will produce a mixing of Fourier domain samples
and nearby frequency samples will be correlated. The exact form of the
correlation depends on the shape of the window function.
If the spectral width (frequency resolution) of the window function
is Wr then the square of the coefficient of variation (the fractional
variance) of the power is
variance
(average power)2
= e2 »
1
l Dt Wr
=
Dn
Wr
When there are no gaps Wr=Dn = 1/l Dt so we
have e2=1 and the normalised power samples are distributed as
Chi-squared with 2 degrees of freedom. If gaps are introduced then
the spectral resolution gets worse (Wr gets larger with
off-diagonal elements in [Wc], the precise definition of
Wr depends on the exact form of the window function) and the
number of degrees of freedom is approximated by
k =
2
e2
»
2 Wr
Dn
(You should recall that the mean of a Chi-squared distribution is k,
the number of degrees of freedom, and the variance is 2k.)
To get the correct normalisation for the distribution you must
sum the power in the samples across the spectral width Wr and then
scale by 2(l-lg)/s2 where lg is the number of missing samples.
The number of frequency samples across Wr is
approximated by the ratio of the total number of evenly spaced samples in
the time series l divided by the number of exposed samples.
lr »
l
l-lg
Therefore after summing across the resolution bandwidth and scaling the
effective scaling factor is 2l/s2 as before. However the gaps
have reduced the coefficient of variation e and the power
spectrum has been smoothed.
As might be expected there is a direct trade between the signal to noise
of the power samples and the spectral resolution. This behaviour
of the power spectrum is discussed in detail by Blackman and Tukey (1958).
If the gaps in the data are large or frequent then the measured power
spectrum G¢ will be distorted. In particular periodic
peaks in G
will be accompanied by side peaks associated with the power
spectrum of the window function. Periodic components in the real domain
must be found by a matched filtering operation in the Fourier domain
looking for the characteristic shape of the power spectrum of the
window function in the spectrum.
If the times series is very long then the potential frequency resolution
is very high but may be compromised by gaps (or the capacity of the
DFT routine). In such a case it
may be better to split the data into a series of shorter observations
not so plagued by missing samples. Breaking the data into M smaller series
of equal length l/M we can take the power spectrum of each sub-set of data
and then add the power spectra samples together to construct the
mean power spectrum of the complete data set. If there are very few missing
data values in each sub-set
the normalised mean power samples will be distributed as Chi-squared with
2M degrees of freedom (e2=1/M) so the signal-to-noise will
be improved by order ÖM and the spectral sampling (resolution)
will be reduced to Dn = M/l Dt.
If there are a limited number of data samples and many gaps or if the
periodic modulation expected has a peculiar pulse shape (not sinusoidal)
then the DFT is not well suited to finding a periodic signal.
In such cases a better approach is
the method of epoch folding. Assuming a trial period of duration
t each time sample is converted into a phase by
dividing by the period and taking the remainder (modulo function) again divided
by the period. If the original
time samples have a natural resolution of Dt then to avoid
any beating between this sampling and the trial period the times
ti must be artificially perturbed by ±Dt/2 using
a random number generator. So assuming ri is a random number taken from a
distribution evenly distributed over the interval
-Dt/2 ® +Dt/2, we have phases
pi=modulo(ti+ri,t)/t
The weighted frequency distribution rj
(histogram) of these phase samples is
then accumulated by summing the data into np bins
spanning the phase space of 0® 1 using
the phase index for the ith time sample given by
j=integer(pi np) (=0® np-1)
As well as accumulating the data values g(i) using the index j the
unweighted frequency distribution or the occupancy uj
of each phase
bin is also accumulated. This tells us the exposure each phase bin has
received including the influence of data gaps. The mean data value for
each phase bin can then be calculated
-
r
j
=
rj
uj
The variance of this mean must be estimated from the errors on the data
samples s2i. If this is constant (si=s) then
s2j=
s2
uj
If there is a modulation in the data at the trial period t then
we expect the phase samples [`r]j to exhibit the pulse
shape. If not, then providing there are sufficient data samples
so that uj ¹ 0 and any secular trends occur over time scales
much larger than the trial period, we
expect the phase samples to be constant [`r]. We can test the
hypothesis that the folded data are constant using a statistic
St =
np-1 å
0
(
-
r
j
-
-
r
)2
s2j
In the absence of periodic or secular variations we expect St
to be distributed as c2 with np-1 degrees of freedom.
So to search for periodic modulation we calculate St for a large
number of trial periods and look for values above some specified
confidence level. In order to avoid missing significant peaks we must
choose periods corresponding to a frequency spacing 1/2T where T
is the duration of the data set (T=lDt) if there are l samples
at spacing Dt).
The choice of the number of phase samples np depends on what is
required of the analysis.
If the modulation is roughly sinusoidal and maximum detection
senstivity is needed then np=2 will suffice. Each phase sample will then
be the summation of approximately half the original data samples but the
fractional variance on each will be a factor l/2 that of the original
data sample. In the other extreme if np=l then the variance of each point
is on average the same as in the original data.
So the signal to noise for detection of a periodic variation by folding can be
increased by order Ö(l/2) compared with looking for the variation in the
raw data. It is precisely this signal to noise gain which makes both epoch
folding and the DFT so sensitive in finding periodic variations in
long stretches of data.
On the other hand if the pulse shape contains sharp steps or the
phase of the pulse is of prime interest then np must be chosen
commensurate with the mark-space ratio of the pulse and phase accuracy required.
Incidentally an easy way to find the phase of the modulation is
to calculate the first Fourier component of the folded profile.
R1=
1
l
å
j
exp(-i2pj/l)
-
r
j
Then the phase of the modulation (relative to t=0)
at period t is
Press W.H., Teukolsky S.A, Vetterling W.T. and
Flannery B.P., 1992, "Numerical Recipes in Fortran - The Art of Scientific
Computing", 2nd Ed., Cambridge University Press