I'd have thought that you could gather statistics from the noise sample and use those to decide.
For example, with the mean and standard deviation noise power in each bin you could analyse to see if it's approximately a Gaussian (Normal bell curve) probability distribution (as the Central Limit Theorem suggests it would be for many noise sources). (You might find that amplitude fits a normal distribution, rather than power, though - so analyse histograms on some real noise samples to see what is best)
If you take a particular frequency bin you could then estimate whether the contents of that bin are within the statistical limits you'd expect (e.g. power from zero to the mean + 2 standard deviations implies about 97.5% confidence that it's just noise and no appreciable signal, or mean + 3 standard deviations implies about 99.8% confidence). Then you can either silence the selection or subtract something like the mean power in some way that's a little more gradual and graceful (meaning you might reduce the tinkling or burbling that some Noise Reduction algorithms introduce, but at the expense of slightly more noise).
If, however the power in that bin is higher than the expected noise, you might assume that it contains a signal you wish to pass as well as the expected amount of noise, so you could pass it unchanged or with the noise subtracted. Subtracting some proportion of the expected noise might be good if it's only a fraction louder, thus reducing the amount of tinkling from suddenly letting noise through in one frequency bin because it was the 2.5% of noise that exceeded 2 standard deviations.
In my previous incarnation as DickD I made a couple of posts regarding NR techniques a few years ago.
One was regarding possible use of psychoacoustics to reduce the amount of NR applied during loud passages (in which the noise would be masked anyway).
The other was
a method of blind-testing NR algorithms by superposing tape hiss (arithmetic add or subtract, for example) onto a CD recording and providing a different sample of the same tape's noise to seed the NR statistics, allowing the use of ABX or ABC/HR. In the second post in that thread, 2Bdecided pointed out quite correctly that the satisfying string-slapping transients of the original double-bass part were greatly diminished by all the NR settings I tried out. Atonal transients like these have a relatively white spread spectrum with rather low power spectral density in each frequency bin, making them look similar to noise, and putting them in danger of being removed or severely damaged by NR algorithms. Transients like these give a lot of the enjoyment to music and make me reluctant to use the CoolEdit96 and Audacity NR filters unless the sound is really bad.
Thoughts on approachI'm sure that working in greater-than-16-bit depth (e.g. 24-bit) then dithering back to 16-bit at the end (if required - or remaining in 24-bit before further processing) is the right thing to do.
In general, my inclination would be to try a smoothly varying attenuation in each frequency bin as the power spectral density (PSD) in a bin varies close to the expected noise PSD or a few standard deviations from it, thus making more natural distortions than a hard cut-off would produce. (I guess such an attenuation should aim to scale down the real and imaginary components corresponding to each frequency bin by the same factor, that factor being unique to that frequency bin).
I guess that a relatively short time window (for 44100 Sa/s PCM audio perhaps 1024-sample FFT, maybe lapping with one 512 samples before?) would probably work reasonably well, though it needs testing. This is approximately the time-frequency resolution of the human ear (this is distinct from interaural time resolution of microseconds that gives spatial cues in binaural hearing) and might help preserve transients a little better than CoolEdit's 8192-sample typical NR settings. Then again, perhaps lessons could be learnt from transient detection and short windows used in lossy compression schemes.
If quality is more important than processing time, I guess it's even possible to analyze both the noise sample and the signal+noise audio with two different window lengths. We could probably detect and preserve transients in the desired signal better because they would stand out above the noise floor better over a shorter window length, and we could even compare the degree by which the signal appears to be above the expected noise with two different window lengths that overlap in time.
I'm mainly throwing around some ideas in case it may help you to code things flexibly enough to test various options to fine-tune the NR algorithm. I'm sure there is scope to learn a good deal by experimenting with approaches to various parts of the algorithm, perhaps optimising each in turn to home in on a near-optimal overall solution, or perhaps using a design-of-experiments type of approach to explore the solution space.
I'd guess it makes sense to gather various samples of hiss to add to a selection of clean CD recordings (e.g. tape hiss from ferric, chrome with and without Dolby B being used, perhaps some vinyl noise at 45 and 33 rpm) and separate noise samples (uncorrelated to the noise added to the music).
Other possible noises to remove might include mains hum, TV line scan whistle etc, though it possible that an algorithm that's good at gracefully removing gaussian hiss without damaging wanted transients might be less effective at removing hum and whistle which might better dealt with by a plain notch filter.