You will agree that in-place conversion is a desirable property to save memory and to avoid extra buffers (in the case you don't need to keep the original values). Then I propose the following change to dsp_util::convert_32_to_64 that enables it to make in-place conversion (src=dest). dsp_util::convert_64_32 can be used to convert values in-place with no change, but not the reverse.
CODE
__declspec(naked) void __fastcall convert_32_to_64 (float *src, double *dest, unsigned count) {
_asm { // src: ECX, dest: EDX, count: [ESP+4]
MOV EAX,DWORD PTR [ESP+4]
LEA ECX,[ECX+4*EAX]
LEA EDX,[EDX+8*EAX]
SHR EAX,2
JZ Remainder
BlockLoop:
LEA ECX,[ECX-16]
LEA EDX,[EDX-32]
FLD DWORD PTR [ECX+12]
FLD DWORD PTR [ECX+8]
FSTP QWORD PTR [EDX+16]
FSTP QWORD PTR [EDX+24]
DEC EAX
FLD DWORD PTR [ECX+4]
FLD DWORD PTR [ECX]
FSTP QWORD PTR [EDX]
FSTP QWORD PTR [EDX+8]
JNZ BlockLoop
Remainder:
MOV EAX,DWORD PTR [ESP+4]
AND EAX,3
JZ Exit
SingleLoop:
LEA ECX,[ECX-4]
LEA EDX,[EDX-8]
FLD DWORD PTR [ECX]
DEC EAX
FSTP QWORD PTR [EDX]
JNZ SingleLoop
Exit:
RET 4
}
}
The trick here is to reverse the conversion direction, starting from the end towards the start of the buffer, so we never overwrite a value yet to be converted. Maybe this has a negative impact on processor cache prediction, but nothing is perfect.
I haven't coded the C++ version but it should be just changing some post++ by --pre, and an initial displacement to the position just beyond the last value.