hi!
1st of all i think you have to decide wether FFT is what you want.
FFT analyses data in a rectangular window.
it "interprets" the data as if the window was repeated in a loop over and over again.
so it's not really made for normal audio data.
this limitation can be overcome, but it takes a bit more than just choosing a good window-size.
next thing is, it makes a difference if you only want to analyze data and display something useful,
or if you want to add some processing steps, maybe change the data in the frequency-domain and do the inverse transform.
if you only want to display something without much processing of the data it should be enough to just cut it into peaces and process.
the size depends on how much resolution in the frequenzy domain you need.
if you make chunks of 1 second each, you'll get 1Hz resolution.
if you make your chunks 1/100 second each, you'll get 100Hz resolution.
if you want to analyze the full audio-spektrum (including bass) you probably need big chunks like 1/10 of a second or larger.
if you want to proccess the data in some way, and need accurate results meaning you don't want to miss anything that might fall onto window bounds, you have to make sure every sample in the processed audio-stream appears in at least 2 chunks. better 3.
one way to do this is to
1) cut the data into chunks of N length.
2) take the first 3 chunks
3) multiply the first chunk with a linear ramp from 0 to 1 (ascending)
4) multiply the third with a linear ramp from 1 to 0 (decending)
5) form a big chunk of 3*N length.
6) FFT' the big chunk
7) do whatever you want with it
8) advance by N samples (not 3*N!) and goto 2)
this is called "windowing" - in this example you would be using a very simple window but that usually don't hurt.
( of course you can always use 5, 7, 9... chunks to form one big chunk (if you require more resolution on the time-axis), but using more than 3 you should use another windowing technique, and things start to get complicated from there :) )
also if you change the data in the frequenzc domain and inverse transform it back, you'll get no audible "clicks" which you usually do get when you use the plain rectangular window.
when you inverse transform the data back to audio-samples, you get 3*N samples again.
to "reconstruct" the original stream, you just have to add the overlapping chunks, that is, the one you "ramped" before the FFT.
you don't have to modify them in any way, because the linear ramps make overlapping chunks that reconstruct the original signal when added together again.
(to get a perfect reconstruction you should use a 0-1 (ascending) ramp for step 4) too, and take x=x-(x*ramp(t)) instead of just x=x*other_ramp(t) - otherwise you'll get slight rounding-errors).
on other thing:
i'd not recommend re-implementing any FFT algorithm, since there are plenty plenty plenty good open-source packages available.
one other thing (if you want maximum accuracy in the frequenzy domain) you could consider is a wavelet analysis.
but this considerably slows things down, since you have to make far much more computations if you need frequency-resolution.
otherwise if you only need the power in every octave (means one ocatve resolution), wavelet analysis can be very fast, many times faster than any FFT approach.
one big plus of wavelets is you can choose your resolution for the frequency AND for the time-axis easily.
so. back to fft.
for 2^n sized windows i always use ooura's FFT package available here:
http://momonga.t.u-tokyo.ac.jp/~ooura/fft.htmlif you need other sizes than powers of 2 i can't recommend a FFT package, but there are plenty that work with arbitrary sizes.
i hope this helped at least a little bit,
bye,
--hustbaer
p.S.: you should NOT interpolate the audio-data in any way, since you would lower the precision of the result.
just take 22khz as 22khz and process it.
unless you know exactly what you are doing it makes absolutely no sense to interpolate.
if you must for some reason, you should at least use some simple digital filter but thats a completely different topic.