that's a dilly of a pickle.
the music is stereo, so an m-s transform will damage the music. you might be able to put up with it though (the vocals will not be separated from the music entirely, but rather will be mixed with a difference channel).
to mitigate any strange effects, you could limit the m-s transform to the speech bands (maybe 150-5000hz would do it without being too annoying). some bass and presence in the vocals will be b0rk, but this shouldn't be as annoying as having the vocals all on one side.
i'd do this in the multitrack view - one track for m-s, one for regular. use the FFT filter combined with the convolver to separate the bands out (get the filter you want in FFT filter, and apply it to a dirac pulse which you feed to multitrack as a realtime convolution effect).
well, that's my best guess so far.
[edit]
oh, and after the bands are separated, put a channel mixer on the "speech" track, set to "mid/side".
i have no idea whether this will work satisfactorily or not. you can always convert it all to mono