For my particular DSP purposes I've needed to use the two dsp_util "convert" functions, since I depend on compiled 3rd party code that only accepts 32bit floats instead of native foobar2000 64bit float samples.

You will agree that in-place conversion is a desirable property to save memory and to avoid extra buffers (in the case you don't need to keep the original values). Then I propose the following change to dsp_util::convert_32_to_64 that enables it to make in-place conversion (src=dest). dsp_util::convert_64_32 can be used to convert values in-place with no change, but not the reverse.

CODE

__declspec(naked) void __fastcall convert_32_to_64 (float *src, double *dest, unsigned count) {

 _asm {  // src: ECX, dest: EDX, count: [ESP+4]

       MOV     EAX,DWORD PTR [ESP+4]
       LEA     ECX,[ECX+4*EAX]
       LEA     EDX,[EDX+8*EAX]
       SHR     EAX,2
       JZ      Remainder

BlockLoop:
       LEA     ECX,[ECX-16]
       LEA     EDX,[EDX-32]

       FLD     DWORD PTR [ECX+12]
       FLD     DWORD PTR [ECX+8]
       FSTP    QWORD PTR [EDX+16]
       FSTP    QWORD PTR [EDX+24]

       DEC     EAX

       FLD     DWORD PTR [ECX+4]
       FLD     DWORD PTR [ECX]
       FSTP    QWORD PTR [EDX]
       FSTP    QWORD PTR [EDX+8]
       JNZ     BlockLoop

Remainder:
       MOV     EAX,DWORD PTR [ESP+4]
       AND     EAX,3
       JZ      Exit

SingleLoop:
       LEA     ECX,[ECX-4]
       LEA     EDX,[EDX-8]

       FLD     DWORD PTR [ECX]
       DEC     EAX
       FSTP    QWORD PTR [EDX]
       JNZ     SingleLoop

Exit:
       RET     4
 }
}


The trick here is to reverse the conversion direction, starting from the end towards the start of the buffer, so we never overwrite a value yet to be converted. Maybe this has a negative impact on processor cache prediction, but nothing is perfect.

I haven't coded the C++ version but it should be just changing some post++ by --pre, and an initial displacement to the position just beyond the last value.