FLAC I/O less efficient than STDIN, Direct file access can be almost twice slower than STDIN |
![]() ![]() |
FLAC I/O less efficient than STDIN, Direct file access can be almost twice slower than STDIN |
Nov 26 2012, 18:26
Post
#1
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
I was optimizing caudec when I came across this oddity. Basically, letting /usr/bin/flac access .flac files on a slowish HDD directly for decoding ('flac -d file.flac') was in one particular case almost twice slower than piping the files to /usr/bin/flac via STDIN ('cat file.flac | flac -d -').
I used a double album for testing, made of 37 tracks for a total of about 1 GiB, located on a HDD that tops out at about 70 MB/s. Incidentally, flac decodes on my machine at a similar rate. I ran caudec twice (figuratively - I repeated the tests many times) with 8 concurrent processes, for decoding those FLAC files to WAV on a ramdisk. I made sure to drop all caches between each run. First run was with direct file access, and completed in 40 seconds. Second run was with piping to STDIN, and completed in 25 seconds. The difference was much less pronounced, surprisingly, on a USB flash drive that tops out at 35 MB/s, 34 seconds vs. 30 seconds, and non-existant on a RAID 1 array that tops out at 130 MB/s and on a SSD that tops out at 500 MB/s. I experienced similar differences with WavPack. Does anyone have any idea of what's going on? -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 26 2012, 19:11
Post
#2
|
|
|
Group: Members Posts: 144 Joined: 14-February 12 Member No.: 97162 |
Running the reading and decoding processes in parallel?
|
|
|
|
Nov 26 2012, 20:19
Post
#3
|
|
![]() Group: Members Posts: 512 Joined: 4-June 02 Member No.: 2220 |
I would imagine reading and writing to the same mechanical disk would be the culprit. This is supported by the USB drive measurement but that doesn't appear to explain everything. If I am understanding correctly the plain version of decoding is slower than a workaround-type STDIN decoding. Although I am unfamiliar with *nix there are two possibilities (either or both):
OS - handling of STDIO regarding write-buffering via HDD driver and/or CPU; BIN - when outputting the data from a STDIN source it reads/writes larger chunks and the chunk size works better with the write-buffering and/or CPU cache (I recall something a FB2K regarding something about differences in seek-table when the encoder used STDIN, not sure why this would also affect decoding). -------------------- "Something bothering you, Mister Spock?"
|
|
|
|
Nov 26 2012, 20:59
Post
#4
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
Running the reading and decoding processes in parallel? Yes. I would imagine reading and writing to the same mechanical disk would be the culprit. Like I said, I was writing to a ramdisk! The drives were only read from, not written to. -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 26 2012, 22:02
Post
#5
|
|
![]() Group: Members Posts: 512 Joined: 4-June 02 Member No.: 2220 |
Ok, I see that in your third paragraph. (Pardon my oversight, I posted right before a medical appointment so apparently I was distracted.)
If I read correctly: FLAC -d [HDD -> RAMdisk] = 40s FLAC STDIO [HDD -> RAMdisk] = 25s FLAC -d [flashdrive -> RAMdisk] = 34s FLAC STDIO [flashdrive -> RAMdisk] = 30s RAID = no tangible difference SSD = no tangible difference You mentioned WavPack having similar behavior so it doesn't seem the binary is the culprit either. I don't suppose using less concurrent threads would improve the performance of FLAC -d but it might be worth checking. It also is unclear how these threads are distributed but I wondered if multiple decode threads caused an unintended bottleneck (especially with fast-decoding formats). edit: I should also have mentioned I thought an instance of STDIO was limited to one thread per file, but this may be a bad assumption on my part. This post has been edited by Destroid: Nov 26 2012, 22:08 -------------------- "Something bothering you, Mister Spock?"
|
|
|
|
Nov 26 2012, 22:22
Post
#6
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
If I read correctly: Yes. I don't suppose using less concurrent threads would improve the performance of FLAC -d but it might be worth checking. Using 4 processes instead of 8: HDD direct: 45 seconds, HDD STDIN: 27 seconds; USB direct: 34 seconds, USB STDIN: 31 seconds. Note that I'm using a quad-core CPU with hyperthreading (4 cores, 8 threads). I should also have mentioned I thought an instance of STDIO was limited to one thread per file, but this may be a bad assumption on my part. Yes. Really, the only difference here is that I delegated the reading to /usr/bin/cat. That alone magically improves performance, particularly in the HDD case. /usr/bin/cat is doing something right, that /usr/bin/flac is doing wrong, or so it seems anyway. -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 26 2012, 23:16
Post
#7
|
|
|
Group: Members Posts: 28 Joined: 20-May 11 Member No.: 90802 |
try using just one thread for cat and the same for flac -d. perhaps one binary is optimized for multi-core, and one is not.
seems to me that flac would use ram to decode. the less ram (because of the ramdisk) may be limiting the decode ability of flac. but I think that maybe the same could be said for cat. that's assuming you used actual ram and not swap or other temporary hard drive space for the ramdisk. also... /usr/bin/flac seems like a binary provided by your distribution. maybe try using a more optimized one that you compiled, or even one from rarewares (if they have it) since caudec supports wine anyway. |
|
|
|
Nov 26 2012, 23:28
Post
#8
|
|
|
Group: Members Posts: 144 Joined: 1-March 11 Member No.: 88621 |
I tested this on my machine with mostly insignificant differences.
On a 171MB FLAC -8 encoded file. I ran one decode to allow Linux to cache the FLAC file in RAM and then discarded the results. I did 3 runs... A proper process efficient redirection to standard input: flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac real 0m3.740s user 0m3.432s sys 0m0.300s A process inefficient pipe from cat: cat 01-A\ Change\ Of\ Seasons.flac | flac -o test.wav -d - real 0m3.869s user 0m3.428s sys 0m0.720s Allowing FLAC to read the file itself: flac -o test.wav -d 01-A\ Change\ Of\ Seasons.flac real 0m3.765s user 0m3.392s sys 0m0.336s I ran 3 runs of each test and while the numbers fluctuated slightly, the time spread remained similar on all runs. In the given example run: The difference between the best run (process efficient redirect) and the worst (pipe from cat) is 129ms The difference between the process efficient redirect and directly reading is only 25ms. Given this admittedly abysmally inadequate sample size it would appear that shell STDIN redirection provides the fastest decode, but the difference between redirection and directly reading the file is small enough to basically dismiss as noise. It would appear that in all cases the context switches involved with invoking cat yield the slowest results by a significant margin. This post has been edited by yourlord: Nov 26 2012, 23:29 |
|
|
|
Nov 26 2012, 23:42
Post
#9
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
I tested this on my machine with mostly insignificant differences. For obvious reasons: On a 171MB FLAC -8 encoded file. You used a single file (my experiments use many, concurrently) that amounts to a rather small amount of data (I used a total of 1 GiB in order to make the differences more pronounced)! I ran one decode to allow Linux to cache the FLAC file in RAM and then discarded the results. You let your OS cache your file on purpose, so it was decoded from RAM. Why? I'm talking about hard drive access. Do you even understand what I'm talking about? A proper process efficient redirection to standard input: flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac Actually, I tried that, and it's a lot less efficient in my experiments than piping cat's output: my test, using that method, completes in 40 seconds (versus 25) off my HDD. Given this admittedly abysmally inadequate sample size Your entire testing process is completely off-topic. -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 27 2012, 01:21
Post
#10
|
|
|
Group: Members Posts: 144 Joined: 1-March 11 Member No.: 88621 |
You let your OS cache your file on purpose, so it was decoded from RAM. Why? I'm talking about hard drive access. Do you even understand what I'm talking about? I eliminated the hard drive from the equation because you're trying to investigate an issue with many variables in play. My goal was to narrow this down to one factor, the method of delivering data to flac, and to test to see if there is a discernible and significant performance pattern related to them. As expected, in my limited testing, using shell builtin redirection was faster than spawning a wasted cat process and was slightly faster than letting flac read it directly. You came here with a question about why in your script you are seeing this anomaly, and the first step in that is to break down the steps you use and test if there is an inherent inefficiency in them. You raised the question about the different performance based on how the data was delivered to flac, and I set out to test each method to see if there was a significant performance hit for any one. You take a slight performance penalty for spawning unneeded processes, and it has an impact even on a single decode. Multiply that 125ms by several hundred decodes and it adds up. I was simply presenting a small data set test to show the performance differences between the methods you asked about. It may or may not be the source of your problem, but it's a data point that can be considered and then confirmed or eliminated as a contributing factor. You came here with a problem and I tried to provide a small bit of data to aid in your investigation. I'm sorry if my attempt at helping offends you. A proper process efficient redirection to standard input: flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac Actually, I tried that, and it's a lot less efficient in my experiments than piping cat's output: my test, using that method, completes in 40 seconds (versus 25) off my HDD. Then there is something else terribly wrong. Shell native redirection should ALWAYS be faster than piping the data from cat. There's an entire process that no longer needs to be spawned and managed (cat) for every single decode operation. I'm not sure why there appears to be a slight performance hit for having flac read the file directly. I'd need to dig into the sources to see where the difference lies but I would imagine there was a lot more thought put into efficient IO by the people writing bash than by the guy who wrote flac. |
|
|
|
Nov 27 2012, 01:32
Post
#11
|
|
![]() Group: Members (Donating) Posts: 1983 Joined: 4-January 04 From: Austin, TX Member No.: 10933 |
You might try `blockdev --setra 65536 --setfra 65536 <device>` to set blockdev/fs readahead to ridiculously high values.
It's possible that the difference in performance between the HD, USB HD and RAID are primarily due to small I/O timing differences between the processes tickling the pagecache in different ways. |
|
|
|
Nov 27 2012, 01:43
Post
#12
|
|
![]() Group: Members (Donating) Posts: 1983 Joined: 4-January 04 From: Austin, TX Member No.: 10933 |
Then there is something else terribly wrong. Shell native redirection should ALWAYS be faster than piping the data from cat. There's an entire process that no longer needs to be spawned and managed (cat) for every single decode operation. Not true. The pipe adds an extra layer of buffering between the filesystem read and the decoding process (and one whose size is adjusted dynamically by the kernel). With a redirect, whenever flac read()s stdin for new data, the read goes right to the kernel. With a pipe, the filesystem read may have already been completed by cat. |
|
|
|
Nov 27 2012, 02:46
Post
#13
|
|
![]() Group: Members (Donating) Posts: 1983 Joined: 4-January 04 From: Austin, TX Member No.: 10933 |
OP might also try tuning the I/O scheduler; see e.g. Documentation/block/switching-sched.txt, Documentation/block/cfq-iosched.txt, and Documentation/block/deadline-iosched.txt.
|
|
|
|
Nov 27 2012, 02:52
Post
#14
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
You might try `blockdev --setra 65536 --setfra 65536 <device>` to set blockdev/fs readahead to ridiculously high values. Bingo! With those values, the test on the HDD completed in 16 seconds in all cases! All my drives were set to 256 sectors (128 KiB). I noticed that performance improved dramatically when adjusting that value a single step up (512), and 2048 (1 MiB) sounds like a rather sane value. -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 27 2012, 19:59
Post
#15
|
|
![]() Group: Members (Donating) Posts: 1983 Joined: 4-January 04 From: Austin, TX Member No.: 10933 |
Cool.
Note that --setra and --setra are completely different settings IIRC. Setting these values too high could compromise performance on other applications, so unless the drive is devoted to music, you should probably tune them down appropriately. I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params. |
|
|
|
Nov 27 2012, 21:59
Post
#16
|
|
![]() Group: Members Posts: 1066 Joined: 4-May 04 From: France Member No.: 13875 |
I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params. Yes: 23 seconds with CFQ/readahead at 256 (vs. 40 seconds with deadline), 17 seconds with CFQ/readahead at 16384. I completely forgot that I changed the scheduler to deadline years ago. -------------------- Save my friend from going homeless: http://outpost.fr/url/308w
|
|
|
|
Nov 28 2012, 01:20
Post
#17
|
|
![]() Group: Members (Donating) Posts: 1983 Joined: 4-January 04 From: Austin, TX Member No.: 10933 |
I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params. Yes: 23 seconds with CFQ/readahead at 256 (vs. 40 seconds with deadline), 17 seconds with CFQ/readahead at 16384. I completely forgot that I changed the scheduler to deadline years ago. This post has been edited by Axon: Nov 28 2012, 01:24 |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 26th May 2013 - 01:00 |