Help - Search - Members - Calendar
Full Version: CUDA Coding Contest
Hydrogenaudio Forums > Lossy Audio Compression > MP3 > MP3 - Tech
skamp
QUOTE
Encoding MP3s always seem to take longer than it should. Do you wish you can find a way to make the process run faster?

Now there is a way. Enter this Contest by creating a program that will speed up MP3 encoding on a CUDA-enabled GPU, and the winner will receive a US$5,000.00 cash prize, subject to the Official Rules.

The Challenge
Create a MP3 encoder that runs on a CUDA-enabled GPU (click here for a list of CUDA-enabled GPUs) by taking a modified version of LAME as we provide and modify its source code to allow the encoding to be performed on a CUDA-enabled GPU. Your encoder must be created using the CUDA programming environment and must achieve a speedup in run-time.

http://cudacontest.nvidia.com/index.cfm?ac...amp;contestid=2
Garf
Only for "Resident in the United States or Canada".

The rules allow you to essentially throw out the entire psymodel. In fact I see nothing regarding quality, just the fact that the file should decode. [1] "has incorrect audio" is meaningless.

I propose the following winning entry: make a CUDA program that fills a buffer with zeroes. Write said zeros into an MP3 compatible bitstream.

I think it would be hard to beat.

[1] A logical thing to do would be to enforce some kind of similarity between the original encoder, but this isn't possible to do because Nvidia's trigonometric functions have limited accuracy. So you couldn't use CUDA for most calculations. That's obviously not what they want.
Slipstreem
On the other hand, it might be possible to create a version that was actually useful if it ran on an ATI GPU. I believe the inaccuracies of which you speak are responsible for the reason why a Protein Folding client exists for ATI GPUs and not for nVidia GPUs. I certainly wouldn't entertain the idea of entering a competition that excluded the very hardware that might actually be capable of doing the job properly. smile.gif

Cheers, Slipstreem. cool.gif
Garf
I am pretty sure nor Nvidia nor ATI have full IEEE compliance in their GPUs.

Don't get me wrong, it's certainly possible to create a very fast and excellent quality MP3 encoder with CUDA.

I just believe the rules of this contest invite making something else.
WonderSlug
QUOTE
taking a modified version of LAME as we provide and modify its source code


Hmm...

That tells me that nVidia has already taken a version of the LAME source code (probably 3.97, not any of the 3.98 betas) and pre-modified it. It's this modified source code that an entrant will be using in the CUDA environment, not the original LAME source code.

I guess that's why they don't stipulate that the encode/decode MP3 data of the CUDA version be equivalent to that of the original non-CUDA LAME.

In other words, there's likely a reason why nVidia doesn't state, for instance, that the CUDA version of LAME 3.97 produce MP3s that are the exact same as that of "standard" non-CUDA LAME 3.97, just much faster, since it's CUDA/GPU.

There's also a reason why ATI GPUs aren't mentioned: This contest is sponsored by nVidia. So it makes sense they would want people to use only their GPUs and not those of the main competitor. Just think of the PR nightmare should a person make a version for the ATI that functions better than with nVidia, and nVidia has to pay the winner $5000 to boot!
Mike Giacomelli
QUOTE (Garf @ May 19 2008, 18:54) *
I am pretty sure nor Nvidia nor ATI have full IEEE compliance in their GPUs.


They do not, but IEEE compliance is a problem that can be worked around with careful programming and a good understanding of the math.

QUOTE (Garf @ May 19 2008, 18:54) *
Don't get me wrong, it's certainly possible to create a very fast and excellent quality MP3 encoder with CUDA.

I just believe the rules of this contest invite making something else.


Presumably they'll take whoever gives the best results, and only accept "cheating" if no one does any better. And anyway, an inaccurate solution is still extremely interesting, even if its not immediately useful for encoding.

Actually since few people have GPUs that work with CUDA, the main interesting point here is going to be the math I think.

Edit: Does LAME actually need to compute trig functions as it goes? I assumed they used lookup tables like most decoders do.
Garf
QUOTE (Mike Giacomelli @ May 20 2008, 01:34) *
QUOTE (Garf @ May 19 2008, 18:54) *

I am pretty sure nor Nvidia nor ATI have full IEEE compliance in their GPUs.


They do not, but IEEE compliance is a problem that can be worked around with careful programming and a good understanding of the math.


Sure. For most applications you don't actually need to bother. The point is that ensuring identical outputs between LAME and CUDA LAME would be a major pain and outside the scope of the contest.

QUOTE
QUOTE (Garf @ May 19 2008, 18:54) *

Don't get me wrong, it's certainly possible to create a very fast and excellent quality MP3 encoder with CUDA.

I just believe the rules of this contest invite making something else.


Presumably they'll take whoever gives the best results, and only accept "cheating" if no one does any better.


It's a contest with set rules.

QUOTE
And anyway, an inaccurate solution is still extremely interesting, even if its not immediately useful for encoding.

Actually since few people have GPUs that work with CUDA, the main interesting point here is going to be the math I think.

Edit: Does LAME actually need to compute trig functions as it goes? I assumed they used lookup tables like most decoders do.


LAME is an encoder, not a decoder. Your input is not so tightly constrained as with a decoder, so it's much harder to tabelize everything. Of course, a sufficiently motivated person could surprise me smile.gif

A (non-cheating) winning entry probably needs to strike a good balance between using the stream processors for DSP work such as the filterbank and psymodel, and processing multiple frames in parallel.
Gabriel
QUOTE (Garf @ May 20 2008, 07:56) *
A (non-cheating) winning entry probably needs to strike a good balance between using the stream processors for DSP work such as the filterbank and psymodel, and processing multiple frames in parallel.

Prediction -
The winner will:
*offload FFT to the GPU, as nVidia is already offering a cuda FFT
*leave MDCT on the CPU, as they won't care trying to understand it, and nVidia does not offer a 1152 cuda MDCT
*offload the dreadfull fs/4 time-domain lowpassing used for attack detection, as it's cpu intensive and simple to offload
*offload the ReplayGain filtering to the GPU
*Try to process psymodel on 2 granules at once, but will either fail or seriously damage the attack detection.

Hardly exciting, those are only usual transform and filtering operations, and not mp3 specific. I would have love to see stuff like quantization or noise calculation on the GPU.
Martel
I thought that CUDA generated a shader code. If ATI cards have the same version of shaders as nvidia cards, I don't see a technical reason why shouldn't a CUDA-generated code run on whatever hardware that meets the shader version requirements... It must be just nvidia limiting it to the most expensive cards (and, of course, just their own cards).
This is such a bullsh*t! It will be much more worthy to accelerate LAME using GLUT or some other universal API, not some proprietary crap which runs on just like 5 per cent of graphic hardware in the world (and for no rational reason).
skamp
Whatever the result is, it will be Free Software, if I'm not mistaken. Presumably a lot of the necessary work will be done; even if changes/tweaks need to be made, that'll be a lot less work than doing everything from scratch.
cabbagerat
QUOTE (Gabriel @ May 19 2008, 22:52) *
Prediction -
The winner will:
*offload FFT to the GPU, as nVidia is already offering a cuda FFT
*leave MDCT on the CPU, as they won't care trying to understand it, and nVidia does not offer a 1152 cuda MDCT
*offload the dreadfull fs/4 time-domain lowpassing used for attack detection, as it's cpu intensive and simple to offload
*offload the ReplayGain filtering to the GPU
*Try to process psymodel on 2 granules at once, but will either fail or seriously damage the attack detection.

Hardly exciting, those are only usual transform and filtering operations, and not mp3 specific. I would have love to see stuff like quantization or noise calculation on the GPU.
Interesting, thanks Gabriel. Last year two engineering students in the lab where I work did some investigations on signal processing with CUDA for their honours projects. The conclusions were interesting - but not too different from what other people have found - FFTs and MDCTs are extremely fast on CUDA (much faster than the exotic Clearspeed hardware which it was being compared to), FIR filtering was fast and IIR filtering was pretty poor. I don't know too much about the work, and which filter structures they used, but it seems that recursive filters don't map too well to the GPU hardware. I have never profiled LAME, but I would agree that the low hanging fruit here is the FFT and MDCT (if they go to the effort of porting). The replaygain filtering probably isn't too great a win, as it uses a fairly short IIR filter.

I do agree that it's not hugely exciting - MP3 decode acceleration might actually be more interesting for games. Still, CUDA (and GPGPU stuff in general) is some really nice tech - but NVidia need to convince people that it's going to have a lifetime longer than the current generation of cards to make it worth sinking real development effort into.
skamp
If Fast Fourier Transform is one of the things in LAME that could be sped up, wouldn't using FFTW or djbfft help with the stock C implementation?
Garf
QUOTE (skamp @ May 20 2008, 15:17) *
If Fast Fourier Transform is one of the things in LAME that could be sped up, wouldn't using FFTW or djbfft help with the stock C implementation?


LAME already has SSE/SSE2/3DNow! etc optimized versions.
cabbagerat
QUOTE (skamp @ May 20 2008, 05:17) *
If Fast Fourier Transform is one of the things in LAME that could be sped up, wouldn't using FFTW or djbfft help with the stock C implementation?
Another problem is that FFTW is GPL and LAME is LGPL - you could link the two together - but only in a GPL application. Lame itself can't link against FFTW, even though FFTW is a really good library.

Some interesting CUDA FFT performance comparisons against FFTW can be found in Signal Processing on a Graphics Card: An Analysis of Performance and Accuracy. There is also a lot of other stuff all around the web.
skamp
QUOTE (Garf @ May 20 2008, 18:34) *
LAME already has SSE/SSE2/3DNow! etc optimized versions.

Again, not for 64-bit builds (not for anything that isn't x86_32, for that matter).

QUOTE (cabbagerat @ May 20 2008, 20:34) *
Another problem is that FFTW is GPL and LAME is LGPL - you could link the two together - but only in a GPL application. Lame itself can't link against FFTW, even though FFTW is a really good library.

I guess that's just a matter of writing a GPL front-end that would link to both libmp3lame and FFTW?
Jebus
My understanding is that CUDA is capable of many times over the performance that SSE/SSE2 provides.

I understand why LAME is LGPL, but couldn't FFTW be implemented in a GPL'd executable, and the DLL remain LGPL without the optimizations? Or, is it possible to have the license vary depending on the compile-time option used? For lots of projects, a GPL'd DLL wouldn't be an issue, but I know for lots it would be. I see no benefit in using the LGPL for the executable, however.
robert
Here is a profile of a 45 minute track encoded with LAME 3.98b8 -V0:
CODE
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.59 23.01 23.01 206986 0.00 0.00 L3psycho_anal_vbr
18.29 42.51 19.50 129418890 0.00 0.00 calc_sfb_noise_x34
10.32 53.51 11.00 choose_table_MMX
10.24 64.43 10.92 103494 0.00 0.00 AnalyzeSamples
7.45 72.37 7.94 from3
4.25 76.90 4.53 fht_3DN
3.60 80.74 3.84 7451568 0.00 0.00 window_subband
3.40 84.36 3.62 103493 0.00 0.00 VBR_encode_frame
3.11 87.68 3.32 390816 0.00 0.00 LongHuffmancodebits
1.78 89.58 1.90 782720 0.00 0.00 calc_energy
1.58 91.26 1.68 413972 0.00 0.00 calc_xmin
1.51 92.87 1.61 103494 0.00 0.00 mdct_sub48
1.50 94.47 1.60 587034 0.00 0.00 quantize_x34
1.47 96.04 1.57 133364430 0.00 0.00 fast_log2
1.14 97.26 1.22 413972 0.00 0.00 init_xrpow_core_sse
1.08 98.41 1.15 782720 0.00 0.00 calc_mask_index_l
1.03 99.51 1.10 from2
0.66 100.21 0.70 229598 0.00 0.00 vbrpsy_compute_MS_thresholds
0.64 100.89 0.68 103494 0.00 0.00 fill_buffer
0.61 101.54 0.65 103494 0.00 0.00 lame_encode_buffer_sample_t
0.56 102.14 0.60 413972 0.00 0.00 fft_long
0.51 102.68 0.54 103492 0.00 0.00 lame_encode_buffer_int
0.46 103.17 0.49 103493 0.00 0.00 format_bitstream
0.39 103.59 0.42 103493 0.00 0.00 UpdateMusicCRC
0.36 103.97 0.38 31828816 0.00 0.00 putbits2
0.33 104.32 0.35 434880 0.00 0.00 best_huffman_divide
0.28 104.62 0.30 782720 0.00 0.00 convert_partition2scalefac_l
0.28 104.92 0.30 481303 0.00 0.00 long_block_constrain
0.23 105.16 0.24 103493 0.00 0.00 encodeSideInfo2
0.23 105.40 0.24 103492 0.00 0.00 get_audio_common
0.22 105.63 0.23 434880 0.00 0.00 best_scalefac_store
0.15 105.79 0.16 413972 0.00 0.00 init_outer_loop
0.14 105.94 0.15 1655888 0.00 0.00 ResvFrameBegin
0.12 106.07 0.13 103493 0.00 0.00 VBR_new_iteration_loop
0.09 106.17 0.10 105731 0.00 0.00 short_block_constrain
0.07 106.24 0.07 587034 0.00 0.00 noquant_count_bits
0.06 106.30 0.06 587034 0.00 0.00 set_scalefacs
0.05 106.35 0.05 2050297 0.00 0.00 athAdjust
0.04 106.39 0.04 600952 0.00 0.00 scale_bitcount
0.04 106.43 0.04 103493 0.00 0.00 lame_encode_mp3_frame
0.03 106.46 0.03 135672 0.00 0.00 convert_partition2scalefac_s
0.03 106.49 0.03 22612 0.00 0.00 fft_short
0.02 106.51 0.02 1862874 0.00 0.00 getframebits
0.02 106.53 0.02 206986 0.00 0.00 on_pe
0.02 106.55 0.02 104529 0.00 0.00 lame_get_frameNum
0.02 106.57 0.02 103493 0.00 0.00 lame_get_num_samples
0.01 106.58 0.01 413972 0.00 0.00 init_xrpow
0.01 106.59 0.01 173066 0.00 0.00 tryThatOne
0.01 106.60 0.01 1037 0.00 0.00 brhist_disp
0.00 106.60 0.00 413972 0.00 0.00 ResvAdjust
0.00 106.60 0.00 206988 0.00 0.00 copy_buffer
0.00 106.60 0.00 206986 0.00 0.00 ResvMaxBits
0.00 106.60 0.00 105566 0.00 0.00 lame_get_framesize
0.00 106.60 0.00 103494 0.00 0.00 compute_flushbits
0.00 106.60 0.00 103494 0.00 0.00 update_inbuffer_size
0.00 106.60 0.00 103493 0.00 0.00 AddVbrFrame
0.00 106.60 0.00 103493 0.00 0.00 ResvFrameEnd
0.00 106.60 0.00 103492 0.00 0.00 encoder_progress
0.00 106.60 0.00 103492 0.00 0.00 get_audio
0.00 106.60 0.00 103492 0.00 0.00 lame_get_num_channels
0.00 106.60 0.00 41460 0.00 0.00 console_printf
0.00 106.60 0.00 21582 0.00 0.00 get_lame_short_version
0.00 106.60 0.00 5180 0.00 0.00 ts_time_decompose
0.00 106.60 0.00 2137 0.00 0.00 ATHformula
0.00 106.60 0.00 2137 0.00 0.00 ATHformula_GB
0.00 106.60 0.00 2076 0.00 0.00 lame_get_out_samplerate
0.00 106.60 0.00 2074 0.00 0.00 lame_get_totalframes
0.00 106.60 0.00 1398 0.00 0.00 freq2bark
0.00 106.60 0.00 1039 0.00 0.00 GetCPUTime
0.00 106.60 0.00 1039 0.00 0.00 GetRealTime
0.00 106.60 0.00 1037 0.00 0.00 brhist_jump_back
0.00 106.60 0.00 1037 0.00 0.00 console_up
0.00 106.60 0.00 1037 0.00 0.00 lame_bitrate_hist
0.00 106.60 0.00 1037 0.00 0.00 lame_bitrate_stereo_mode_hist
0.00 106.60 0.00 1037 0.00 0.00 lame_block_type_hist
0.00 106.60 0.00 1037 0.00 0.00 lame_stereo_mode_hist
0.00 106.60 0.00 1037 0.00 0.00 timestatus
0.00 106.60 0.00 1036 0.00 0.00 console_flush
0.00 106.60 0.00 417 0.00 0.00 add_dummy_byte
0.00 106.60 0.00 8 0.00 0.00 frontend_msgf
0.00 106.60 0.00 8 0.00 0.00 lame_msgf
0.00 106.60 0.00 4 0.00 0.00 lame_get_VBR
0.00 106.60 0.00 3 0.00 0.00 lame_get_exp_nspsytune
0.00 106.60 0.00 3 0.00 0.00 lame_set_exp_nspsytune
0.00 106.60 0.00 2 0.00 0.00 BitrateIndex
0.00 106.60 0.00 2 0.00 0.00 free_id3tag
0.00 106.60 0.00 2 0.00 0.00 get_lame_os_bitness
0.00 106.60 0.00 2 0.00 0.00 get_lame_url
0.00 106.60 0.00 2 0.00 0.00 get_lame_version
0.00 106.60 0.00 2 0.00 0.00 init_numline
0.00 106.60 0.00 2 0.00 0.00 init_s3_values
0.00 106.60 0.00 2 0.00 0.00 is_mpeg_file_format
0.00 106.60 0.00 2 0.00 0.00 lame_get_bWriteVbrTag
0.00 106.60 0.00 2 0.00 0.00 lame_get_decode_only
0.00 106.60 0.00 2 0.00 0.00 lame_get_mode
0.00 106.60 0.00 2 0.00 0.00 lame_get_msfix
0.00 106.60 0.00 2 0.00 0.00 lame_get_quant_comp
0.00 106.60 0.00 2 0.00 0.00 lame_get_quant_comp_short
0.00 106.60 0.00 2 0.00 0.00 lame_get_short_threshold_lrm
0.00 106.60 0.00 2 0.00 0.00 lame_get_short_threshold_s
0.00 106.60 0.00 2 0.00 0.00 lame_set_num_channels
0.00 106.60 0.00 2 0.00 0.00 setLameTagFrameHeader
0.00 106.60 0.00 1 0.00 0.00 CloseSndFile
0.00 106.60 0.00 1 0.00 0.00 GetTitleGain
0.00 106.60 0.00 1 0.00 0.00 InitGainAnalysis
0.00 106.60 0.00 1 0.00 0.00 InitVbrTag
0.00 106.60 0.00 1 0.00 0.00 OpenSndFile
0.00 106.60 0.00 1 0.00 0.00 SmpFrqIndex
0.00 106.60 0.00 1 0.00 0.00 apply_preset
0.00 106.60 0.00 1 0.00 0.00 apply_vbr_preset
0.00 106.60 0.00 1 0.00 0.00 brhist_init
0.00 106.60 0.00 1 0.00 0.00 close_infile
0.00 106.60 0.00 1 0.00 0.00 disable_FPE
0.00 106.60 0.00 1 0.00 0.00 encoder_progress_begin
0.00 106.60 0.00 1 0.00 0.00 encoder_progress_end
0.00 106.60 0.00 1 0.00 0.00 flush_bitstream
0.00 106.60 0.00 1 0.00 0.00 freegfc
0.00 106.60 0.00 1 0.00 0.00 frontend_close_console
0.00 106.60 0.00 1 0.00 0.00 frontend_open_console
0.00 106.60 0.00 1 0.00 0.00 get_lame_very_short_version
0.00 106.60 0.00 1 0.00 0.00 has_3DNow
0.00 106.60 0.00 1 0.00 0.00 has_MMX
0.00 106.60 0.00 1 0.00 0.00 has_SSE
0.00 106.60 0.00 1 0.00 0.00 has_SSE2
0.00 106.60 0.00 1 0.00 0.00 huffman_init
0.00 106.60 0.00 1 0.00 0.00 id3tag_init
0.00 106.60 0.00 1 0.00 0.00 id3v2_add_latin1
0.00 106.60 0.00 1 0.00 0.00 init_bit_stream_w
0.00 106.60 0.00 1 0.00 0.00 init_fft
0.00 106.60 0.00 1 0.00 0.00 init_files
0.00 106.60 0.00 1 0.00 0.00 init_infile
0.00 106.60 0.00 1 0.00 0.00 init_log_table
0.00 106.60 0.00 1 0.00 0.00 init_outfile
0.00 106.60 0.00 1 0.00 0.00 init_xrpow_core_init
0.00 106.60 0.00 1 0.00 0.00 iteration_init
0.00 106.60 0.00 1 0.00 0.00 lame_bitrate_kbps
0.00 106.60 0.00 1 0.00 0.00 lame_close
0.00 106.60 0.00 1 0.00 0.00 lame_encode_flush
0.00 106.60 0.00 1 0.00 82.03 lame_encoder
0.00 106.60 0.00 1 0.00 0.00 lame_get_ATHcurve
0.00 106.60 0.00 1 0.00 0.00 lame_get_ATHlower
0.00 106.60 0.00 1 0.00 0.00 lame_get_RadioGain
0.00 106.60 0.00 1 0.00 0.00 lame_get_VBR_max_bitrate_kbps
0.00 106.60 0.00 1 0.00 0.00 lame_get_VBR_min_bitrate_kbps
0.00 106.60 0.00 1 0.00 0.00 lame_get_VBR_quality
0.00 106.60 0.00 1 0.00 0.00 lame_get_athaa_sensitivity
0.00 106.60 0.00 1 0.00 0.00 lame_get_encoder_delay
0.00 106.60 0.00 1 0.00 0.00 lame_get_encoder_padding
0.00 106.60 0.00 1 0.00 0.00 lame_get_findReplayGain
0.00 106.60 0.00 1 0.00 0.00 lame_get_force_ms
0.00 106.60 0.00 1 0.00 0.00 lame_get_free_format
0.00 106.60 0.00 1 0.00 0.00 lame_get_id3v1_tag
0.00 106.60 0.00 1 0.00 0.00 lame_get_id3v2_tag
0.00 106.60 0.00 1 0.00 0.00 lame_get_lametag_frame
0.00 106.60 0.00 1 0.00 0.00 lame_get_maskingadjust
0.00 106.60 0.00 1 0.00 0.00 lame_get_maskingadjust_short
0.00 106.60 0.00 1 0.00 0.00 lame_get_version
0.00 106.60 0.00 1 0.00 0.00 lame_init
0.00 106.60 0.00 1 0.00 0.00 lame_init_params
0.00 106.60 0.00 1 0.00 0.00 lame_print_config
0.00 106.60 0.00 1 0.00 0.00 lame_set_ATHcurve
0.00 106.60 0.00 1 0.00 0.00 lame_set_ATHlower
0.00 106.60 0.00 1 0.00 0.00 lame_set_VBR
0.00 106.60 0.00 1 0.00 0.00 lame_set_VBR_q
0.00 106.60 0.00 1 0.00 0.00 lame_set_VBR_quality
0.00 106.60 0.00 1 0.00 0.00 lame_set_athaa_sensitivity
0.00 106.60 0.00 1 0.00 0.00 lame_set_debugf
0.00 106.60 0.00 1 0.00 0.00 lame_set_errorf
0.00 106.60 0.00 1 0.00 0.00 lame_set_findReplayGain
0.00 106.60 0.00 1 0.00 0.00 lame_set_in_samplerate
0.00 106.60 0.00 1 0.00 0.00 lame_set_maskingadjust
0.00 106.60 0.00 1 0.00 0.00 lame_set_maskingadjust_short
0.00 106.60 0.00 1 0.00 0.00 lame_set_msfix
0.00 106.60 0.00 1 0.00 0.00 lame_set_msgf
0.00 106.60 0.00 1 0.00 0.00 lame_set_num_samples
0.00 106.60 0.00 1 0.00 0.00 lame_set_quant_comp
0.00 106.60 0.00 1 0.00 0.00 lame_set_quant_comp_short
0.00 106.60 0.00 1 0.00 0.00 lame_set_short_threshold_lrm
0.00 106.60 0.00 1 0.00 0.00 lame_set_short_threshold_s
0.00 106.60 0.00 1 0.00 0.00 lame_set_write_id3tag_automatic
0.00 106.60 0.00 1 0.00 0.00 parse_args
0.00 106.60 0.00 1 0.00 0.00 psymodel_init

% the percentage of the total running time of the
time program used by this function.

cumulative a running sum of the number of seconds accounted
seconds for by this function and those listed above it.

self the number of seconds accounted for by this
seconds function alone. This is the major sort for this
listing.

calls the number of times this function was invoked, if
this function is profiled, else blank.

self the average number of milliseconds spent in this
ms/call function per call, if this function is profiled,
else blank.

total the average number of milliseconds spent in this
ms/call function and its descendents per call, if this
function is profiled, else blank.

name the name of the function. This is the minor sort
for this listing. The index shows the location of
the function in the gprof listing. If the index is
in parenthesis it shows where it would appear in
the gprof listing if it were to be printed.


The FFT takes about 5 percent of processing time. If the CUDA approach manages to reduce its cost to zero, you will get a 5 percent faster LAME. The MDCT needs about 2 or 3 percent, not much too. You can have a 10 percent speed increase by specifying "--noreplaygain" now, without any CUDA tricks.

Edit: It seems two seperate long posts may result in one corrupt post because of the merge-feature. So I removed the traditional profile report to reduce the post length.
Mike Giacomelli
QUOTE (skamp @ May 20 2008, 15:19) *
QUOTE (Garf @ May 20 2008, 18:34) *
LAME already has SSE/SSE2/3DNow! etc optimized versions.

Again, not for 64-bit builds (not for anything that isn't x86_32, for that matter).


Use a 32 bit compiler.

QUOTE
I do agree that it's not hugely exciting - MP3 decode acceleration might actually be more interesting for games.


I know very few people who would be "interested" in buying a second GPU to save 5MHz of CPU time. Generally the only things that are interesting to accelerate are things that don't already run orders of magnitude faster then they need to for a given application.
skamp
QUOTE (robert @ May 20 2008, 22:45) *
The FFT takes about 5 percent of processing time. If the CUDA approach manages to reduce its cost to zero, you will get a 5 percent faster LAME. The MDCT needs about 2 or 3 percent, not much too. You can have a 10 percent speed increase by specifying "--noreplaygain" now, without any CUDA tricks.

Does that mean that even a very-well written CUDA port won't improve encoding speed by more than 8-10%?

QUOTE (Mike Giacomelli @ May 21 2008, 17:26) *
Use a 32 bit compiler.

It's already been stated several times that a 64-bit build (thus without ASM optimizations) is already faster than 32-bit builds (with ASM). We're running in circles here.
Garf
QUOTE (skamp @ May 21 2008, 18:05) *
QUOTE (robert @ May 20 2008, 22:45) *

The FFT takes about 5 percent of processing time. If the CUDA approach manages to reduce its cost to zero, you will get a 5 percent faster LAME. The MDCT needs about 2 or 3 percent, not much too. You can have a 10 percent speed increase by specifying "--noreplaygain" now, without any CUDA tricks.

Does that mean that even a very-well written CUDA port won't improve encoding speed by more than 8-10%?



No, it just means that more technique than just using Nvidia's provided CUDA FFT will be needed to get a good speedup.
Martel
QUOTE (Mike Giacomelli @ May 21 2008, 07:26) *
I know very few people who would be "interested" in buying a second GPU to save 5MHz of CPU time. Generally the only things that are interesting to accelerate are things that don't already run orders of magnitude faster then they need to for a given application.
So why accelerate LAME? You may encode on-the-fly (while grabbing) already with today's computers and it won't make much difference compared to WAV grabbing unless you use some overkill drive (e.g. Pioneer grabbing CDDA at 32x) + burst mode.
It would be much more useful to take x264 codec and accelerate that so I may record the TV broadcast directly into H.264. Well, not exactly for me, since I own a HD 3870. sad.gif
Axon
Performance issues could come into play more with surround-sound codecs and Dolby HD encoding.
skamp
QUOTE (Martel @ May 21 2008, 21:35) *
So why accelerate LAME? You may encode on-the-fly (while grabbing) already with today's computers and it won't make much difference compared to WAV grabbing unless you use some overkill drive (e.g. Pioneer grabbing CDDA at 32x) + burst mode.

Think ripping to lossless files and transcoding on the fly to MP3 files on your portable player, instead of ripping to MP3 directly or keeping two copies around. Or, on-demand transcoding on commercial platforms (like with the departed AllOfMP3.com).
Martel
QUOTE (skamp @ May 21 2008, 14:46) *
QUOTE (Martel @ May 21 2008, 21:35) *
So why accelerate LAME? You may encode on-the-fly (while grabbing) already with today's computers and it won't make much difference compared to WAV grabbing unless you use some overkill drive (e.g. Pioneer grabbing CDDA at 32x) + burst mode.

Think ripping to lossless files and transcoding on the fly to MP3 files on your portable player, instead of ripping to MP3 directly or keeping two copies around. Or, on-demand transcoding on commercial platforms (like with the departed AllOfMP3.com).

Are you that desperate on HDD space? You may keep two copies of your music collection without problems. One lossless, one lossy for a portable. The lossy one is just like 20 per cent of space more over the lossless.
This would be good for freaks that transcode their whole lossless music collection into MP3 whenever a new stepping of LAME comes out or whenever they find "better" LAME settings. smile.gif
And if you're talking about industrial use of LAME, then writing LAME for MPI (message passing interface - API for computer clusters) would be like thousand times more valuable. IT centers usually have cluster(s) but not nvidia 8800+ cards.
skamp
QUOTE (Martel @ May 22 2008, 10:19) *
Are you that desperate on HDD space?

That space could be used for something else (HD video for instance).

QUOTE (Martel @ May 22 2008, 10:19) *
And if you're talking about industrial use of LAME, then writing LAME for MPI (message passing interface - API for computer clusters) would be like thousand times more valuable.

Good point.
Brent
QUOTE (Martel @ May 22 2008, 10:19) *
Are you that desperate on HDD space?

Managing two libraries and keeping them synchronized by hand is _much_ more of a hassle than realtime transcoding from a lossless master to whatever you need.

I do agree though that with today's cpu's, offloading this to the gpu is of questionable use, though it's a fun exercise.
cbuchner1
Hi there,

with the GPU having so many PARALLEL sub-processors, wouldn't it bring a huge benefit to encode multiple segments of the input audio stream in parallel (given that the recording is already available on disk and not streamed live). The GPU could then perform several FFTs, MDCTs or other filter operations at the same time.

Just the parts where these segments overlap could be iffy. And maybe managing the bit reservoir across these segments.

Also: With the nVidia contest only allowing US residents, why not start a European counter contest to do the same with Ogg Vorbis or other open codecs? the prize money doesn't have to be $5000, a mere 500 Euros could to the job.

Christian
Mike Giacomelli
QUOTE (cbuchner1 @ Jun 19 2008, 19:03) *
Hi there,

with the GPU having so many PARALLEL sub-processors, wouldn't it bring a huge benefit to encode multiple segments of the input audio stream in parallel (given that the recording is already available on disk and not streamed live).


Probably not worth the effort. I doubt you'll be limited by the performance of one pipe on the GPU, so no sense using more then 1.
cabbagerat
QUOTE (Mike Giacomelli @ Jun 19 2008, 16:48) *
Probably not worth the effort. I doubt you'll be limited by the performance of one pipe on the GPU, so no sense using more then 1.
Due to the lack of real caching on the GPU, the performance per-pipe is actually pretty bad. GPUs extract performance from parallelism in two ways: (1) by running multiple threads on the same unit, swapping out when one thread is waiting for memory accesses and (2) by having multiple units. Having said that, a good FFT (or DCT) implementation for the GPU will take advantage of the parallelism inherent in the FFT, and pretty much saturate the GPU - so processing multiple audio chunks in parallel is unlikely to gain you much performance.
Martel
Well, if the conditions of the contest would be loading an entire wave into GPU and having it output a finished MP3 stream, it might even turn out that it would be slower than on a CPU, no matter the theoretical computational capacity.
CUDA might be good but only for a narrow class of problems.
cabbagerat
QUOTE (Martel @ Jun 19 2008, 23:47) *
CUDA might be good but only for a narrow class of problems.
That's certainly true. However, with the single cycle multiply-accumulate (MAC) and double precision in the new architecture, that problem space is expanding. Still, it needs to be treated as a coprocessor - working with the CPU and not replacing it.
Ron Jones
Just wanted to update that this contest ended, officially, on 7/25, and there were a total of 27 submissions from 214 competitors (wow).

Anyway, I've signed up with the NVIDIA CUDA developer network so I could download the source code to the winning entry, and no executable/DLL is included, so I'm rather curious if anyone here could have a stab at compiling this thing. I'm not a programmer, so the percentage chance of me managing to successfully compile this is about 3%.

Interestingly, the winner was chosen based on some sort of "score", but there's no explanation as to what the "score" actually is. For whatever it's worth, this particular encoder achieved a score of 2.46 wink.gif

NOTE: I believe I'm in the clear to re-distribute this source code, as I didn't agree to any terms or conditions when I registered with NVIDIA, but I'm not 100% certain.
Martel
I would gladly compile and test this but unfortunately, I have Radeon 3870. The code compiles just fine until nvcc is needed to compile a single libmp3lame/new_scheme.cu file. This is the only file I can see that makes some difference.
The file seems to do just scalefactor/distortion calculations. It seems that they don't even use the CUDA FFT.
Garf
QUOTE (Ron Jones @ Aug 6 2008, 01:36) *
Interestingly, the winner was chosen based on some sort of "score", but there's no explanation as to what the "score" actually is. For whatever it's worth, this particular encoder achieved a score of 2.46 wink.gif


Should be speedup based on the contest rules.

The winning entry appears to be valid, with which I mean it does something reasonable.

However, the speedup is hardly impressive. I believe much better results are possible. Still, I'm pleasantly surprised so far.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.