University of Leicester
Department of Physics and Astronomy
Lecture Notes
Data Analysis Techniques

Dr. R. Willingale

Nov 12, 2007

Contents

1  Introduction
2  Modelling the instrument response
3  Systematic errors and noise
4  The data analysis problem
5  Direct inversion
6  Chi-squared fitting and confidence intervals
7  Linear, isoplanatic systems, stationary processes and Fourier filtering
8  Least squares fitting and the Wiener filter
9  Time series analysis
10  Fourier analysis of time series

1  Introduction

This course is a formal introduction to data analysis rather than just a collection of things you can do with data. Most data you will encounter are produced by some instrument designed for purpose and the basic philosophy I will present is that a very large fraction of data analysis can be described using the same formalism involving an instrument response coupled with systematic errors and noise.
Such a formal approach has advantages. When confronted by a data analysis problem you should try and describe it using the recipe I will provide. Having done this the solution may be immediately apparent, using a well established technique or some method developed in another discipline but equally applicable in the situation presented by your data. Even if the solution is not so obvious the various methods which might be applicable will be clear. It is often the case that new data sets are very similar to old except that some feature of the system is slighly different. Established methods may be applicable if you make some minor adjustment or some acceptable approximation. If all this fails you will have a succinct statement of your problem on which to base a novel solution. You will be well placed to invent your own data analysis techniques.

2  Modelling the instrument response

In general an instrument responds to a function in some input space and produces a data set in some output space. For example the objective of a camera takes the angular distribution of light falling on the aperture and converts it into an intensity distribution over the focal plane. In turn this distribution is converted into coloured dyes over the surface of a plastic film. The input and output space are VERY different but if the camera and film are well made and correctly used there is a tight relationship between the original light distribution and the processed result.
The response of a large class of instruments can be modelled using an integral equation of the form:

g(x¢,y¢,z¢) = ó
õ
ó
õ
ó
õ
f(x,y,z) h(x¢,y¢,z¢,x,y,z) dx dy dz
f(x,y,z) is a piecewise continuous function over the input space and g(x¢,y¢,z¢) is a continuous function over some output space. h(x¢,y¢,z¢,x,y,z) is the response function and in general depends on both the input and the output space. For simplicity we shall consider a 1-D system:

g(x¢) = ó
õ
f(x) h (x¢,x) dx
The extension to higher dimensions is normally straight forward but requires care and attention to details.
Note that if the input is an impulse f(x)=Ad(x-a) then using the properties of the Dirac delta function:

g(x¢) = A h(x¢,a)
The response function can be thought of as a normalised version of the output we expect from a perfect impulse input. In principle we can measure the response function h(x¢,a) by stimulating the system with an impulse input at x=a.
A perfect instrument would have a response:

h(x¢,x) = C d( x¢-x)
where C is just a scaling constant.
A linear instrument has a response such that if f1 Þ g1 and f2 Þ g2 then

f1 + f2 Þ g1 + g2
So, for instance, if we double the input level the output level is also doubled.
An isoplanatic system has a response which is only a function of the difference between the input and output coordinates:

h(x¢,x) = h(x¢- x)
This means that the shape or form of the response is the same for every input position. In this case the system equation becomes a convolution of faltung integral:

g(x¢) = ó
õ
f(x) h(x¢-x) dx
Although in most cases the underlying response of the instrument is properly modelled by an integral equation involving continuous functions the use of computers and digital techniques naturally leads to a digitization of the response. The input and output functions become vectors (or more generally matrices and tensors) and the integration becomes a summation:

gi =
å
j 
hij fj
We must be careful that this representation is a good approximation to the true response by proper choice of the samples in the input and output space. If the function h(x¢,x) changes rapidly as a function of x or x¢ then we must use a large number of samples across the change. We must also make sure that all finite (significant) contributions to the integral are included in the range covered by i and j. Now the instrument response has become a RESPONSE MATRIX. It is usually only an approximation to reality because of numerical integration errors. If the response is linear the elements of hji are constants and if it is isoplanatic then each row (or column) of the matrix is just a circular permutation of the first row (or column).

æ
ç
ç
ç
ç
ç
è
1
2
3
4
4
1
2
3
3
4
1
2
2
3
4
1
ö
÷
÷
÷
÷
÷
ø
Because of this the matrix multiplication performs an approximation to a circular convolution. To avoid wrap-around (aliasing) problems you must pad out the rows (or columns) with zeros.
If the instrument involves more than 1 dimension then the input and output matrices (or tensors) can be stacked to form vectors. The response will still be a matrix but will have a complicated banded structure representing the coupling or blurring between the different coordinates. So the simple digitized system equation is still valid but the interpretation of the indices i and j is rather complicated. If the the system is isoplanatic and the response is separable:

h(x¢,y¢,x,y)=hx(x¢-x)hy(y¢-y)
and the response matrix has a so called Toeplitz form such that hij=hmn if i-j=m-n. If the system response is isoplanatic but not separable then the stacked response matrix has a block Toeplitz form.

3  Systematic errors and noise

Whether we model our instrument using an integral equation or a summation of discrete elements it is always an idealisation. The real world includes something extra in the data that is not included in our prescription. These are, of course, the dreaded errors or instrument noise hated by undergraduates. Actually it is these errors or the noise which makes data analysis interesting and a challenge.
Systematic errors are generally things which we could/should have included in the system response but couldn't/didn't. They can be eliminated or at least reduced by calibration.
Noise is something which is inherent in the processes involved in the measurement system but which is not amenable to the instrument response prescription. It can't be included in the response function but must be allowed for in data analysis.
It is conventional to include both systematic errors and noise as an additive term in the response equation:

gi =
å
j 
hij fj + ni
If each ni is independent of its neighbours and independent of the input fj then the noise is said to be uncorrelated. In many cases this is a good approximation. Note that this does not mean that ni is independent of i. The noise can still be a function of position in the output space and uncorrelated.
In many modern instruments the noise arises directly from the inherent statistical nature of the processes involved. This can be thermal noise, for instance in an electrical resistance, or counting statistics when detecting individual particles, electrons, photons, neutrons etc..
The noise can come from the source itself in which case f(x) is essentially a probability density distribution or fj are probabilities if the samples are essentially integrals over increments in x.
The noise can be instrumental in which case h(x¢,x) (or hij) describes the mean behaviour of the system and ni is something which is added by the measurement process.
In many cases we can think of the output function g(x¢) as a probability distribution (or gi as probabilities) and then ni (or at least a component of ni) is just characteristic of the output of the instrument.
In any particle counting experiment we must encounter counting statistics. gi will then be the mean number of particles expected in a given sample x ® x+Dx over a given time interval. If the particles are independent (i.e. not bunched or anti-bunched) then the probability of detecting m=0,1,2,¼ events (particles) will be governed by the Poisson distribution:

p(m) = lm exp(-l)

m!
where l = m is the mean number detected (which is gi in our description above). The variance of the Poisson distribution is s2=l so it will depend on gi and is NOT independent of the object distribution. The output of a particle counting experiment will also contain spurious or background events which are not related to the desired signal. These must be included in ni and appear as an additive term in the mean of the Poisson distribution so gi=li=si+bi where si is the mean signal count expected in the ith sample and bi is the mean background count. The variance is then given by the total mean count si2=si+bi.
The Poisson distribution is a limiting case of the Binomial distribution when the probability q of detecting an event in a single trial is very small. If we make N trials then:

p(m)= N!

m!(N-m)!
qm(1-q)N-m
The mean of this Binomial distribution is m = Nq and the variance is s2=Nq(1-q).
Finally if N is very large and neither q or 1-q is too small then the binomial distribution becomes a continuous probability density, the normal or Gaussian distribution:

p(z)= 1


Ö

2p
exp(- z2

2
)
where z=[(m-Nq)/(Ö{Nq(1-q)})] a standardized continuous variable. It is of course the normal distribution that we usually use to model the additive noise ni in experiments which don't involve counting events:

p(ni)= 1

si
Ö

2p
exp -(ni-mi)2

2si2
where mi and si2 are the mean and variance of the noise in the ith sample.
In many problems it is necessary to consider a statistical model for the source and noise processes using moments or expectation values to characterise the underlying probability density function. In discrete notation using the sampled functions the first two moments, the mean and covariance are

uf= < f >

[ff] = < (f-uf)
~
(f-uf)
 
>
where < > denotes the expectation value and [()\tilde] the transpose. uf represents the mean as a function of position and [ff] describes the correlation between different positions.
A stochastic process is said to be stationary if uf is a constant vector and [ff] has a Toeplitz form (or block Toeplitz if we are dealing with a stacked set of multiple dimensions). This means that the correlation function of the process only depends on the difference in position rather than the absolute position.
A matrix closely related to the above moments is the correlation matrix

[cf]= < f 
~
f
 
> = [ff]+[uf 
~
uf
 
]

4  The data analysis problem

In the context of the system equation

g(x¢)= ó
õ
f(x) h(x¢,x) dx +n(x¢)
or

gi =
å
j 
hij fj + ni
we can state the problem posed in data analysis; given a set of measurements gi what can we say about the source vector fj or the source function f(x)?
If the noise function n(x¢) is independent of the unknown source function f(x) then we are trying to solve a Fredholm equation of the first kind. If n(x¢) is a function of f(x) then the integral equation is said to be a Fredholm equation of the second kind. In either case it is important to remember that in almost all experiments the discrete notation is an approximation to an integral equation.
If the instrument response is a complicated transformation (a Fourier transform say) or if the response is rather poor, then the major problem is to calculate a good estimate of fj given gi. This is usually called an inversion problem. How easy the inversion is depends on the form of the response matrix and the noise level. Good examples of inversion problems are aperture synthesis in radio astronomy or the correction for the optical blurring in the Hubble telescope. When performing an inversion the central question is how reliable is the result? What features of [^(f)] can we believe and which are due to noise?
If you are lucky (or if you don't make a mistake manufacturing the primary mirror) then the instrument response is good and it is easy to estimate fj given the data gi. Then the problem shifts to another plane. What can we deduce about the processes that produce f(x) given the measurements gi Þ fj? In this case we can wonder about the physical system beyond the immediate instrument and the problem is a modelling problem. Although the response hij may not present any difficulty the noise ni will still be a limiting factor. In this case we require to find good estimates of the model parameters and confidence limits on these parameters.

5  Direct inversion

We can attempt to invert the system equation

g=[h]f+n
by pre-multiplying both sides by the inverse of the response matrix

[h]-1 g=[h]-1[h]f+[h]-1n
If [h]-1 exists, [h]-1[h]=[I] the unit matrix and if there is no noise n=0 then this procedure will recover the original source vector

[h]-1g=f
Unfortunately the inverse [h]-1 is usually ill-conditioned (some of the elements of [h]-1 are very large because they are formed by division by very small numbers) or doesn't exist at all (the determinant is zero).
When we attempt the inversion the noise term [h]-1n blows up and dominates. What is needed is some pseudo inverse matrix which is optimized to recover as much of the source vector as possible in the presence of noise. In practice this means supressing the ill-conditioned elements in the true inverse [h]-1 and trying to produce a best estimate of the source vector [^(f)].
Apart from the limitations imposed by noise most data sets are very large and the instrument response matrix is therefore rather large although it may be fairly sparse. The accurate inversion of such large matrices is difficult and time consuming and in many cases is not practical.

6  Chi-squared fitting and confidence intervals

Least squares or maximum likelihood fitting can yield model parameter values, c1,c2,¼ which are is some way the best estimates of the true values given the data. However on their own these methods can't tell us, firstly, whether or not the model is a good interpretation (fit) of the data and, secondly, what confidence limits (errors) we should assign to the parameter values found.
The workhorse of methods for testing goodness of fit is the Chi-squared statistic. This can be used to fit a probability distribution to a measured data set, to test the goodness of fit in some fitting procedure, to yield confidence limits for derived parameters and forms the basis of a fitting procedure in its own right.
Chi-squared is based on the hypothesis that the optimum description of a set of data is that which minimises the weighted sum of the squares of the differences between the data values and predicted values. The variance of the fit is given by

v2= 1

N-k
å
wi ( gi-
^
g
 

i 
)2
where N is the number of data points, k is the number of parameters c1,c2,¼, [^g]i=y(x,c1,c2,¼) some fitting function and wi is a weighting factor for each data point:

wi =
1

s2i

1

N
å
1

s2i
This is a measure of the uncertainty on each data point normalized to the average uncertainty of all the points. Note that if all the data variances are constant s2i=s2 then wi=1 and v2 is simply a sum of square differences divided by N-k.
Chi-squared is defined as

c2= å
1

s2i
(gi-
^
g
 

i 
)2
So

c2

N-k
= v2

< s2i >
=c2n
where n = N-k is the number of degrees of freedom and

< s2i > = [ 1

N
å
1

s2i
]-1
the weighted sum of the variances. c2n is called the reduced Chi-squared.
If the fit is good we expect that v2 will be close to the underlying variance of the parent process and c2n » 1. If c2n is too large then the fit variance must be much larger than the expected value and the fit is poor. If c2n is too small then you may have under estimated the errors or included too many parameters in the fit.
We can attempt to quantify the goodness of fit using the expected probability density distribution of c2=z2

p(z2,n)=
(z2)[(n-2)/2]exp -z2

2

2[(n)/2] G( n

2
)
The integral of this function between z2=c2 and z2=¥

P(c2,n) = ó
õ
¥

c2 
p(z2,n) d z2
can be found tabulated in many statistics or data analysis books.
So if you search for parameter values c1,c2,¼ which produce the global minimum c2 you can estimate the probability that the c2min value found exceeds that value. That is if you performed the experiment an infinite number of times then some fraction of the data sets (fraction=P(c2,n)) is expected to yield a resulting c2 larger than the one found from the actual data set.
Unfortunately c2 combines two factors, how good the underlying function y(x,c1,c2,¼) describes the data AND the deviations between the best fit function found and the data set. For example you could achieve a c2n » 1 if the error estimates were too large and the function was a fairly bad fit, or the error estimates were too small and too many parameters were used to produce the fit.
The formal development and justification for the c2 fitting procedure outlined above is rather involved. This is discussed in some detail by Lampton, Margon and Bowyer (1976)  [14]. In the definition cited

c2= å
1

s2i
(gi-
^
g
 

i 
)2
the statistic will only be the true Chi-squared if the quantities [^g]i and s2i are the true population mean and variance values. In any experiment these are only known to nature and not the experimenter. In the idealized instance that they are the true values each term in the summation will be a sample of a random variable with zero mean and unit variance and in the limit of a very large number of samples (N® ¥) the statistic will be normally distributed. When the number of samples is small (or finite) the statistic has the sampling probability distribution of c2true with N degrees of freedom n = N as quoted above.
What we actually calculate is a c2min using a fitting function to yield [^g]i. It can be shown that this sampling statistic is distributed as c2 with n = N-k degrees of freedom when using k functional parameters. This minimization provides the best fit values for the parameters c1,c2,¼ and the c2min value.
Analysis shows that the difference between the c2true and c2min will also have a sampling distribution of c2 but with k degrees of freedom

DS = c2true-c2min=c2k
This difference is independent of the c2true value and the number of data points and only depends on the number of parameters used to generate the c2min.
What we want is the definition of a confidence region or volume in the parameter space so that we can assign errors or confidence ranges to the parameters c1,c2,¼. As we vary the parameter values about the minimum so we expect the Chi-squared to increase. If the increase exceeds some limiting value then the probability that the parameters are correct is small ( < 10% say). The limiting value of the increase is c2limit given by

P(climit2,k) = ó
õ
¥

c2limit 
p(z2,k) dz2 = 0.90
where k is the number of parameters we are varying. So if we fix one of the k parameters at a series of values about the minimum and for each of these fixed values we find the minimum Chi-squared by varying all the remaining parameters we can find the 90% confidence interval for that one parameter. We can do this for each parameter in turn to define the complete confidence region. This procedure ensures the independence of all the fitting parameters. It is not the same as fixing all the parameters at their best fit values and then varying just one parameter at a time. If this is done then 0.90=P(climit2,1) (i.e. k=1) and the confidence limits for each parameter are much smaller. In doing this you are not allowing for the interplay between the parameters in the fitting procedure. Actually if there are a large number of parameters it may be legitimate to assume constant values for some of the parameters and not consider them in the error analysis but just fix them at their best fit or some other values. This reduces the number k and the volume of the confidence region. However you must be consistent in this designation. Once a parameter is assumed constant you can't change your mind later, vary this parameter and start quoting confidence limits for it.
Although the Dc2 above is independent of c2min the procedure is only sensible if the initial c2min is reasonable. In the above we have implicitly assumed that c2min » c2true. There is no point in calculating confidence limits for a model which is poor to start with or if the error estimates are wrong.
It is usually possible to lower the c2min value by introducing more parameters into the fit. So if the fit is poor we can include a new feature into the fitting function to try and model the discrepancy. It is therefore legitimate to ask at what point is the introduction of another parameter no longer justified by the data? This is the purpose of the so called F-test.
If we take the ratio of two Chi-squared statistics then the sampling probability distribution is the F distribution

pf(z,n1,n2) = c21/n1

c22/n2
which has a functional form

pf(z,n1,n2) = G(n1/2+n2/2)

G(n1/2)G(n2/2)
æ
è
n1

n2
ö
ø
n1/2

 
z1/2(n1-2)

(1+zn1/n2)1/2(n1+n2)
When we construct such a ratio it is the integral of the distribution which is of most interest

P(F,n1,n2) = ó
õ
¥

F 
pf(z,n1,n2)dz
So to test for the necessity of an additional fitting parameter (or parameters) we take the ratio of the minimum Chi-squared values obtained with and without the new parameters. We then look up this F value in a tabulated form of the integrated probability above using the appropriate n1 and n2 values. If the ratio is larger (or smaller depending on which way up we take the ratio) than that expected at a given level of confidence then the inclusion of the new parameters is significant at that level. Note that P(F12,n1,n2) is not the same as P(F21,n2,n1) unless n1 and n2 are rather large so we should test both the ratio and its reciprocal taking care with the order of the n1 and n2 values.
If the inclusion of some parameters is not justified by a significant improvement in Chi-squared as defined by the F-test then there is no point in deriving confidence limits for those parameters or indeed even including them in the fit. Such parameters should be removed or fixed in the fitting procedure.

7  Linear, isoplanatic systems, stationary processes and Fourier filtering

The linear, isoplanatic system has a system response matrix with the Toeplitz (circulant) form. A circulant matrix is diagonalized by the Discrete Fourier Transform (DFT). Similarly the block Toeplitz form generated by stacking more than 1 dimensions is diagonalized by the multidimensional DFT. The elements of the 1-D DFT matrix are

tkj=exp(i2pkj /l)

tkj-1=exp(-i2pkj /l)/l
where i=Ö{-1} and l is the number of rows and columns. The properties of the complex exponential function lead to the Fast Fourier Transform algorithm (FFT) and so matrix diagonalisation (and inversion) for such a system can be performed very quickly.
Note that diagonalisation of the linear, isoplanatic system by the DFT can be thought of as the discrete convolution theorem the digital analogue of the familiar convolution theorem which holds for the Fourier transform.
Stationary processes have a covariance matrix with a Toeplitz form which is diagonalized by the DFT. By definition, the diagonal elements of the resultant matrix [Lf] form the power spectrum of the stochastic process. (When dealing with continuous functions rather than vectors the power spectrum is the Fourier transform of the autocorrelation function.)
Discrete Fourier filtering is carried out by taking the DFT of the data vector

G=[t]-1 g
multiplying each element by a filter weighting

G¢=W G
and then taking the inverse transformation

g¢=[t] G¢
If we construct a diagonal matrix from the weighting (filter) vector

Lii=Wi
we can define a matrix

[w]=[t][L][t]-1
and write the operation in the real domain as a matrix multiplication

g¢=[w]g
A consequence of the above is that if we decide to perform Fourier filtering on our data then we are implicitly treating the underlying processes as stationary and we are performing a convolution in the real domain. A Fourier filter treats all areas of the data plane equally; it performs the same correlation everywhere in the data space so it cannot adapt to changes in the noise or signal correlations or the instrument response as a function of position.

8  Least squares fitting and the Wiener filter

If we consider the inversion problem an estimate of the original source vector [^(f)] is required. The error between this estimate and the true vector is:

e=f-
^
f
 
Using the principle of least squares we want to minimize the mean square error

min  <
~
e
 
 e >
where [(e)\tilde] is the transpose of the error vector and < > indicates the expectation value of the product.
We want a matrix operator or filter [w] such that

^
f
 
=[w]g
Using the system equation to substitute for g and minimizing the square error yields:

[w]=[cf]
~
[h]
 
([h][cf]
~
[h]
 
+ [cn])-1
where [cf] and cn] are the correlation matrices of the source and the noise

[cf]= < f 
~
f
 
>

[cn]= < n 
~
n
 
>
and the noise is assumed to be independent of the source

[cnf]= < n 
~
f
 
> = [cfn]=[0]
This form of filter [w] was originally applied to time series by Wiener, 1949  [29] and is therefore called a Wiener Filter. It has been used successfully in image processing and other areas of data analysis following the work of Helstrom (1967).
It is only easy to implement the Wiener filter if a diagonalization transformation exists so it has mostly been used as a Fourier filter

[t]-1
^
f
 
= [ Lf][Lh*] ([Lh][Lf][Lh*]+[Ln])-1[t]-1 g
where the diagonal elements of [Lh] are the DFT of the response function and those of [Lf] and [Ln] are the power spectra of the source and noise processes (assumed to be stationary).
The Wiener filter constructed in this way attempts to restore the best estimate of the source vector (function) assuming a particular simple form for the system response, source statistics and noise.
Perhaps a more familiar application of the Principle of least squares is model fitting. Suppose the estimate [^f](x) (or if you prefer [^(f)]) is modelled by a function y(x,c1,c2,¼) where c1,c2,¼ are unknown parameters or coefficients which are often physical attributes of the source under study, temperature, density, etc.. Such a source function would lead to an estimated data function

^
g
 
(x¢)= ó
õ
y(x,c1,c2,¼) h(x¢,x) dx
The error function (vector) is then

e = g-
^
g
 
and we can again search for

min  <
~
e
 
 e >
varying the parameters c1,c2,¼ to find the best solution. For example if the instrument response is perfect (a delta function) and y=c1+c2x we get the familiar linear regression formulae

c1=

å
i 
gi
å
i 
xi2-
å
i 
xi
å
i 
xi gi

N
å
i 
xi2 -(
å
i 
xi)2

c2=
N
å
i 
xi gi -
å
i 
xi
å
i 
gi

N
å
i 
xi2 -(
å
i 
xi)2
where N is the number of data points.
Similar expressions can be found for the coefficients of higher order polynominals but if y is a complicated function or if the system response is not perfect then we must resort to some form of searching algorithm to find the least squares solution.

9  Time series analysis

Any set of observations of some variable taken at different times is called a time series. The times are usually equally spaced and form a continuous sequence but can be unequally spaced and often contain gaps because of viewing constraints or exposure problems. Each sample can be a snapshot of the variable or some form of average value taken over a time sample. In most cases the measurement instrument does not blur or mix times so that the response hij is diagonal with gaps in the data represented by missing elements along the diagonal. In some cases the variable we measure is a summation of different components which suffer different time delays and then off-axis elements in hij will represent the time delays.
The analysis of time series is the subject of many chapters in books on statistics and signal processing. In general the movements or fluctuations seen are classified into types
So the usual problem of time series analysis is trying to extract these features from the data and characterise them rather than deconvolving the instrument response. We can model the time series as a sample of a continuous function g(t) multiplied by a window function to represent gaps or exposure variations.

gi=g(ti) w(ti)
As with all data anaylsis, superimposed on top of the underlying behaviour, be it periodic, long term trend or irregular, is noise, so the sampling by the instrument is described by an integral equation of the form

g(ti)= ó
õ
+¥

-¥ 
h(ti,t) f(t) dt +n(ti)
If this process is effectively instantaneous then h(ti,t)=d(ti-t) but most instruments will integrate over some small time interval so that h(ti,t)=h(ti-t) where h(ti-t)=1 for 0 < ti-t < t and h(ti-t)=0 otherwise. Providing t is smaller than the time between samples then each gi will be independent of its neighbours. Within this model we want to characterise f(t) give gi.
Trend analysis is best done by linear regression, least squares fitting or Chi-squared fitting as described above. A Chi-squared test of the hypothesis that f(t) is constant is a good way of seeing whether there is evidence for variability of any sort in the data although it can be fooled if there is a very weak periodic signal present as you will see below.

10  Fourier analysis of time series

If the samples are equally spaced at Dt we can look for periodic behaviour using the discrete Fourier transform (DFT) as described under the section on linear, isoplanatic systems.

G = [t]-1 g
where for a vector of length l the forward transform matrix is

t-1kj= 1

l
exp(-i 2 pkj/l)
This expresses the vector g as the linear sum of Fourier components (eigenvectors) with each Gi corresponding to the frequency

ni = i

l Dt
where i=0,1,2,3,¼,l-1 and l is the total number of samples. Gi is complex but if gi are real (which is usually the case) then only the first l/2+1 samples of Gi will be required. The highest frequency in the data is represented by i=l/2 with a frequency of n = 1/(2Dt) the Nyquist frequency (2 samples per period). The remaining samples i=l/2+1® l-1 are a mirror image of the low frequency samples Gi=Gl-i for i=0® l/2-1. The real part of Gi represents the cosine (symmetric) components and the imaginary part the sine (antisymmetric) components. (Note the symmetry is about i=0 not i=l/2.)
Any periodic component in gi will appear as a series of peaks in the vector Gi, at the fundamental frequency and the harmonics (integer multiples of the fundamental). The relative amplitudes of these peaks will depend on the shape of the periodic function. The amplitude of the Fourier components will be given by Ö(Gi Gi*) and the phase angle by tan-1(imag(Gi)/real(Gi)).
There are a number of problems associated with period searching using the DFT. The first is the noise. If the time samples don't overlap the vector Gi is the linear sum of the DFT of fi and the DFT of the noise vector ni.

Gi=Fi+Ni
The power spectrum of the data set (square of the amplitude of Fourier components) is therefore given by

GiGi*=(Fi+Ni)(Fi*+Ni*) = FiFi*+NiNi*+FiNi*+NiFi*
The first two terms on the righthand side are the power spectrum of the signal and noise respectively. If the noise is uncorrelated with the signal then (by definition) the second two cross-terms will be zero. So the data power spectrum is usually a simple sum of the signal and noise power spectra.
By the convolution theorem these power spectra are the Fourier transforms of the auto-correlation functions of the time series of the signal and the noise. If

cf(t¢)= ó
õ
f(t) f(t¢+t) dt
then

Cf(n) = 1

Ö(2 p)
ó
õ
cf(t¢)exp(-i 2 pnt¢) dt¢=Ö(2 p) F(n) F*(n)
Notice that the complex conjugation has been introduced since we are taking an auto-correlation not a convolution, (t¢-t ® t¢+t). Or in discrete notation

cf = [fc] f
where [fc] is a circulant matrix generated from f (see earlier section on modelling the instrument response). The circulant for a crosscorrelation is the transpose of that required for a convolution. Then

Cf = [t]-1 cf = l([t]-1 f)([t]-1 f)*=l F F*
where l is the number of elements in the vector f. Actually multiplication by the circulant form [fc] performs a discrete circular convolution and the vectors must be padded out with zeros to render the discrete result equivalent to the continuous form. A proof of the above and some discussion of these details is given by Hunt (1971).
If the noise is white then there is no correlation between the noise in different data samples and the auto-correlation function of the noise will be a delta function. The power spectrum of the noise will then be a constant since

d(t)= 1

2 p
ó
õ
¥

-¥ 
exp(-i 2 pnt) dn
In practice the data vector (and hence the noise vector n) only contain a finite number of samples so the sample auto-correlation function of the noise vector is approximately a delta function with residual wings. Likewise the noise power spectrum will be flat but will contain residual fluctuations. If the noise is Gaussian with standard deviation s then the mean power in the DFT spectrum N N* is s2/l. Any mean offset in the noise will only appear in the N0 term (sometimes referred to as the DC term by analogy with AC theory). The power samples are distributed as Chi-squared with 2 degrees of freedom (because they are the sum of the squares of the real and imaginary parts of the spectrum, see for example `An introduction to spectral analysis' J.N.Rayner). If the noise is Poissonian (counting statistics) with a constant mean then in the limit of large l the power samples again are distributed as Chi-squared with 2 degrees of freedom (see D.A.Leahy et al 1982). It should be noted that because the Nyquist term (i=l/2) contains only a real component this is distributed as Chi-squared with 1 degree of freedom.
If the signal process f(t) has a stochastic component then this will also add to the mean level per frequency sample but this is seldom white and hence the F F* will not be flat. Any periodic component of f(t) will appear as peaks superimposed on the continuum. The significance of such peaks will be determined by the Chi-squared distribution of the continuum fluctuations. In order to normalise these fluctuations the power should be multiplied by 2l/s2 so that the mean value of the Chi-squared distribution is 2 as expected.
Analysis of power spectra is complicated by the presence of a window function. Since the continuous function g(t) is multiplied by the window function w(t) in the real domain the spectrum of the signal is convolved by the spectrum of the window function in the Fourier domain. In discrete notation we have

G¢ = [Wc] G
where [Wc] is a circulant matrix constructed from the DFT of the window function W. In the trivial case of continuous coverage the circulant will be diagonal because W is a delta function. However if there are missing real domain samples the convolution will produce a mixing of Fourier domain samples and nearby frequency samples will be correlated. The exact form of the correlation depends on the shape of the window function. If the spectral width (frequency resolution) of the window function is Wr then the square of the coefficient of variation (the fractional variance) of the power is

variance

(average power)2
= e2 » 1

l Dt Wr
= Dn

Wr
When there are no gaps Wr=Dn = 1/l Dt so we have e2=1 and the normalised power samples are distributed as Chi-squared with 2 degrees of freedom. If gaps are introduced then the spectral resolution gets worse (Wr gets larger with off-diagonal elements in [Wc], the precise definition of Wr depends on the exact form of the window function) and the number of degrees of freedom is approximated by

k = 2

e2
» 2 Wr

Dn
(You should recall that the mean of a Chi-squared distribution is k, the number of degrees of freedom, and the variance is 2k.) To get the correct normalisation for the distribution you must sum the power in the samples across the spectral width Wr and then scale by 2(l-lg)/s2 where lg is the number of missing samples. The number of frequency samples across Wr is approximated by the ratio of the total number of evenly spaced samples in the time series l divided by the number of exposed samples.

lr » l

l-lg
Therefore after summing across the resolution bandwidth and scaling the effective scaling factor is 2l/s2 as before. However the gaps have reduced the coefficient of variation e and the power spectrum has been smoothed. As might be expected there is a direct trade between the signal to noise of the power samples and the spectral resolution. This behaviour of the power spectrum is discussed in detail by Blackman and Tukey (1958).
If the gaps in the data are large or frequent then the measured power spectrum G¢ will be distorted. In particular periodic peaks in G will be accompanied by side peaks associated with the power spectrum of the window function. Periodic components in the real domain must be found by a matched filtering operation in the Fourier domain looking for the characteristic shape of the power spectrum of the window function in the spectrum.
If the times series is very long then the potential frequency resolution is very high but may be compromised by gaps (or the capacity of the DFT routine). In such a case it may be better to split the data into a series of shorter observations not so plagued by missing samples. Breaking the data into M smaller series of equal length l/M we can take the power spectrum of each sub-set of data and then add the power spectra samples together to construct the mean power spectrum of the complete data set. If there are very few missing data values in each sub-set the normalised mean power samples will be distributed as Chi-squared with 2M degrees of freedom (e2=1/M) so the signal-to-noise will be improved by order ÖM and the spectral sampling (resolution) will be reduced to Dn = M/l Dt.
If there are a limited number of data samples and many gaps or if the periodic modulation expected has a peculiar pulse shape (not sinusoidal) then the DFT is not well suited to finding a periodic signal. In such cases a better approach is the method of epoch folding. Assuming a trial period of duration t each time sample is converted into a phase by dividing by the period and taking the remainder (modulo function) again divided by the period. If the original time samples have a natural resolution of Dt then to avoid any beating between this sampling and the trial period the times ti must be artificially perturbed by ±Dt/2 using a random number generator. So assuming ri is a random number taken from a distribution evenly distributed over the interval -Dt/2 ® +Dt/2, we have phases

pi=modulo(ti+ri,t)/t
The weighted frequency distribution rj (histogram) of these phase samples is then accumulated by summing the data into np bins spanning the phase space of 0® 1 using the phase index for the ith time sample given by

j=integer(pi np)   (=0® np-1)
As well as accumulating the data values g(i) using the index j the unweighted frequency distribution or the occupancy uj of each phase bin is also accumulated. This tells us the exposure each phase bin has received including the influence of data gaps. The mean data value for each phase bin can then be calculated

-
r
 

j 
= rj

uj
The variance of this mean must be estimated from the errors on the data samples s2i. If this is constant (si=s) then

s2j= s2

uj
If there is a modulation in the data at the trial period t then we expect the phase samples [`r]j to exhibit the pulse shape. If not, then providing there are sufficient data samples so that uj ¹ 0 and any secular trends occur over time scales much larger than the trial period, we expect the phase samples to be constant [`r]. We can test the hypothesis that the folded data are constant using a statistic

St = np-1
å
0 
(
-
r
 

j 
-
-
r
 
)2

s2j
In the absence of periodic or secular variations we expect St to be distributed as c2 with np-1 degrees of freedom.
So to search for periodic modulation we calculate St for a large number of trial periods and look for values above some specified confidence level. In order to avoid missing significant peaks we must choose periods corresponding to a frequency spacing 1/2T where T is the duration of the data set (T=lDt) if there are l samples at spacing Dt).
The choice of the number of phase samples np depends on what is required of the analysis. If the modulation is roughly sinusoidal and maximum detection senstivity is needed then np=2 will suffice. Each phase sample will then be the summation of approximately half the original data samples but the fractional variance on each will be a factor l/2 that of the original data sample. In the other extreme if np=l then the variance of each point is on average the same as in the original data. So the signal to noise for detection of a periodic variation by folding can be increased by order Ö(l/2) compared with looking for the variation in the raw data. It is precisely this signal to noise gain which makes both epoch folding and the DFT so sensitive in finding periodic variations in long stretches of data.
On the other hand if the pulse shape contains sharp steps or the phase of the pulse is of prime interest then np must be chosen commensurate with the mark-space ratio of the pulse and phase accuracy required. Incidentally an easy way to find the phase of the modulation is to calculate the first Fourier component of the folded profile.

R1= 1

l

å
j 
exp(-i2pj/l)
-
r
 

j 
Then the phase of the modulation (relative to t=0) at period t is

ft = tan-1 (imag(R1)/real(R1))

References

[1]
Andrews H.C. and Hunt B.R., 19**, "Digital image restoration", Prentice Hall
[2]
Bryan R.K. and Skilling J., 1980, Mon.Not.R.astr.Soc.,190
[3]
Cash W., 1979, Ap.J., 228, 939
[4]
Bevington P.R. and Robinson D.K., 19**, "Data Reduction amd Error Analysis for the Physical Sciences", 2nd Edition, McGraw-Hill, Inc.
[5]
Blackman R.B. and Tukey J.W., 1958, "The measurement of power spectra", Dover publications Inc.
[6]
Brandstäter A., Swift J., Swinney H.L. and Wolf A., 1983, Phys. Rev. Lett. Vol. 51, No. 16, 1442
[7]
Eadie W.T., Drijard D., James F.E., Roos M. and Sadoulet B. "Statistical methods in experimental physics", 1971, North-Holland, Amsterdam
[8]
Hauser M.G. and Peebles P.J.E., 1973, Ap. J. 185, 757-785
[9]
Helstrom C.W.,1967, J.Opt.Soc.Am., 57, 3, 279
[10]
Hunt B.R., 1971, IEEE Trans. AU-19, 4
[11]
Hunt B.R., 1977, "Bayesian methods in nonlinear digital image restoration", IEEE Trans. Computers, C-26, 3, 219
[12]
Kraft R.P., Burrows D.N. and Nousek J.A., 1991, Ap. J. 374, 244-355
[13]
Lahav O., Ficher K.B., Hoffman Y., Scharf C.A. and Zaroubi S., 1994, Ap.J. 423, L93-L96
[14]
Lampton M., Margon B. and Bowyer S., 1976, Ap.J., 208, 1177
[15]
Leahy D.A., Darbro W., Elsner R.F., Weisskopf M.C., Sutherland P.G., Kahn S.M. and Grindlay J.E., 1983, Ap. J., 266, 160
[16]
Lichtenberg A.J. and Lieberman M.A., 1983, "Regular and stochastic motion", Springer-Verlag
[17]
Lucy L.B., 1974, A.J., 79, 745
[18]
Packard N.C., Crutchfield J.P., Farmer J.D. and Shaw R.S., 1980, Phys. Rev. Lett. Vol. 45, No. 9, 712
[19]
Peitgen H.-O. and Saupe D. Eds., 1988, "The science of fractal images", Springer-Verlag
[20]
Pollock A.M.T., Bignami G.F., Hermsen W., Kanbach G., Lichti G.G., Masnou J.L., Swanenburg B.N. and Wills R.D., 1981, Astron. Astrophys. 94, 116-120
[21]
Press W.H., Teukolsky S.A, Vetterling W.T. and Flannery B.P., 1992, "Numerical Recipes in Fortran - The Art of Scientific Computing", 2nd Ed., Cambridge University Press
[22]
Rayner J.N., 1971, "An introduction to spectral analysis", Pion Limited
[23]
Richardson B.H., 1972, J.Opt.Soc.Am., 63, 55
[24]
Roux J.C., Rossi A., Bachelart S. and Vidal C., 1980, Phys. Lett. Vol. 77A, No. 6, 391
[25]
Scharf C., Hoffman Y., Lahav O. and Lynden-Bell D., 1992, Mon.Not.R.astr.Soc., 256, 229-237
[26]
Shepp L.A. and Vardi Y., 1979, IEEE Trans. Medical Imaging, MI-1, 113
[27]
Spiegel M.R., 1961, "Statistics", Schaum's Outline Series, McGraw-Hill
[28]
Spiegel M.R., 1974, "Fourier Analysis", Schaum's Outline Series, McGraw-Hill
[29]
Wiener N., 1949, "The extrapolation, interpolation and smoothing of stationary time series", p84, John Wiley and Sons, Inc.,
[30]
Wilks S.S., 1938, Ann. Math. Stat. 9, 60
[31]
Wolf A., Swift J.B., Swinney H.L. and Vastano J.A., 1985, Physica 16D, 285-317
[32]
Wright E.L., Bennett C.L., Górski K., Hinshaw G. and Smoot G.F., 1996, Ap.J. 464, L21-L24
[33]
Wright E.L., Smoot G.F., Bennett C.L. and Lubin P.M., 1994a, Ap.J., 436, 443
[34]
Yu J.T. and Peebles P.J.E., 1969, Ap. J. 158, 103-113



File translated from TEX by TTH, version 3.74.
On 12 Nov 2007, 08:43.