## Saturday, November 8, 2014

### FFT: the Cooley-Tukey algorithm for general composite lengths

The Cooley-Tukey fast Fourier transform (FFT) algorithm is probably the most well-known FFT algorithm. Their paper "An algorithm for the machine calculation of complex Fourier series" was published in 1965, and it is said to have caused the boom of digital signal processing (DSP) in the 1970s. Note, however, that the algorithm was already known to Gauss at the beginning of the 19th century. The algorithm is based on recursively splitting the problem into smaller sub-problems, and thereby decreasing the computational complexity. In its most well-known form (the radix-2 FFT), the algorithm reduces the complexity from $N^2$ to the famous $N\log(N)$ for a discrete Fourier transform (DFT) of length $N=2^M$.

In this article I want to discuss the Cooley-Tukey algorithm for general composite $N=N_1N_2$. The easiest way to see how the algorithm works is to use index mapping. Consider the length-$N$ DFT of a sequence $x[n]$: $$X[k]=\sum_{n=0}^{N-1}x[n]W_N^{nk}\tag{1}$$ with $W_N=e^{-j2\pi/N}$. Let's assume that $N$ can be factored as $N=N_1N_2$. Now use the following index mapping $$n=n_1N_2+n_2,\quad n_1=0,1,\ldots, N_1-1,\quad n_2=0,1,\ldots,N_2-1\\ k=k_1+N_1k_2,\quad k_1=0,1,\ldots, N_1-1,\quad k_2=0,1,\ldots,N_2-1\tag{2}$$ Plugging (2) into (1) gives $$X[k_1+N_1k_2]=\sum_{n_2=0}^{N_2-1}\sum_{n_1=0}^{N_1-1}x[n_1N_2+n_2]W_N^{(n_1N_2+n_2)(k_1+k_2N_1)}\tag{3}$$ The complex exponentials in (3) can be simplified as follows: $$W_N^{(n_1N_2+n_2)(k_1+k_2N_1)}=W_N^{N_2n_1k_1}W_N^{N_1N_2n_1k_2}W_N^{n_2k_1}W_N^{N_1n_2k_2}=\\= W_{N_1}^{n_1k_1}\cdot 1\cdot W_{N}^{n_2k_1}W_{N_2}^{n_2k_2}\tag{4}$$ because $W_N^{N_1}=W_{N_2}$, $W_N^{N_2}=W_{N_1}$, and $W_N^{N_1N_2}=W_N^N=1$. Plugging (4) back into (3) and rearranging terms gives $$X[k_1+N_1k_2]=\sum_{n_2=0}^{N_2-1}\left[W_N^{n_2k_1}\left(\sum_{n_1=0}^{N_1-1}x[n_1N_2+n_2]W_{N_1}^{n_1k_1}\right)\right]W_{N_2}^{n_2k_2}\tag{5}$$ Equation (5) can be interpreted as follows: the result can be computed by computing $N_2$ length-$N_1$ DFTs of subsampled versions of the input sequence $x[n]$ (this is the inner sum in (5)), then multiplying the result by twiddle factors $W_N^{n_2k_1}$, and finally computing $N_1$ length-$N_2$ DFTs (the outer sum in (5)). This procedure is most easily visualized using matrices:
1. read the data row-wise into a $N_1\times N_2$ matrix
2. compute DFTs over all $N_2$ columns (this can be done in-place)
3. multiply the result element-wise with twiddle factors
4. compute DFTs over all $N_1$ rows (in-place)
5. read out the data column-wise to get the final length-$N$ result
Let's see if and how this procedure reduces the complexity of computing a length-$N$ DFT. Direct computation of a length-$N$ DFT requires $N^2$ complex multiplications. With the above procedure we need $N_2$ DFTs of length $N_1$ ($N_2N_1^2$ complex multiplications), $N_1N_2$ complex multiplications for the twiddle factors, and $N_1$ DFTs of length $N_2$ ($N_1N_2^2$ complex multiplications), resulting in a total of $$N_2N_1^2+N_1N_2+N_1N_2^2 = N_1N_2(N_1+N_2+1) = \\ = N(N_1+N_2+1) < N^2,\quad\text{for } N_1,N_2 > 2$$ So the procedure results in reduced complexity compared to the direct computation of a length-$N$ DFT. If the factors $N_1$ and $N_2$ themselves can be factored, then the method can be used recursively, which will result in a greater reduction of complexity. E.g., if $N=2^M$ this results in the famous radix-2 version of the Cooley-Tukey algorithm, which is discussed in most DSP textbooks.

In practice, an FFT routine which implements $N=2^M$ will be sufficient most of the time, because in most cases one is free to choose the DFT length, and the data can usually be zero-padded to the next power of two. However, there are cases where a non-power of 2 length is required, such as in 3GPP Long Term Evolution (LTE), where for a transmission bandwidth of 15MHz an FFT length of 1536 is required. Suppose that on a certain platform you have an FFT routine for $N=2^M$. In this case, since $1536=3\times 512$, the representation (5) can be used to split up the computation of the DFT in $3$ FFTs of length $512$, which can be computed by the available FFT routine, and $512$ length-3 FFTs, the implementation of which is trivial.

This little octave/Matlab script illustrates the procedure for the case $N=1536$:

N = 1536;         % non-power of 2 FFT length used in LTE
N1 = 512; N2 = 3; % 512 x 3 = 1536
x = randn(N,1);   % some input signal

% generate N1 x N2 matrix of input data, reading them row wise
X = reshape(x,[N2,N1]).';

% DFT over each column
X = fft(X);

% Multiplication with twiddle factors
X = X.*exp(-1i*2*pi/N*(0:N1-1)'*(0:N2-1));

% DFT over each row
X = (fft(X.')).';

X = X(:);
So even if you don't want to program your own FFT algorithm it's very useful to understand the Cooley-Tukey method because it can be used to implement non-power-of-two FFTs using basic (power-of-2) FFT routines, if the FFT length $N$ is composed of only a few (small) factors (other than 2), and this will be the case for almost all practical applications.