Let me provide a little background before I ask my question:
I have been experimenting with various linux-based media players in regards to how they handle very large music libraries and so far I have been generally dissatisfied with the performance of all the major offerings (ie, rhythmbox, bmp/bmpx, banshee, etc). While they all pass as acceptable media players, none have obviously come close to serving as a drop-in solution for everyone's favorit media player, foobar2000. My biggest complaint about these media players is how they handle my obscenely large music collection (27,000+ tracks). I have been spoiled by the experience of foobar where I can open the player and have it come up near instantaneously and immediately have access to my album list in my nifty album list panel. From here I can immediately click on an artist/album/track/etc. and start playing. This is with the very same music collection. This feat however seems to elude all of the developers of linux based media players as there are unacceptable wait times in between each of the steps I mentioned. I know for a fact that this is not a performance bottleneck in linux/gnome because everything else on the same system, for lack of a better term, "hauls ass".
Given the size of the library it has to manage, I see their being 2 bottlenecks. The first is the backend storage method. Some use relational databases, some use xml files, and so on. The performance of the backend will have a huge impact on how quickly the media player loads as it will need to query the entire database on startup. The second is the method used to juggle all that info in memory (ie data structures). These can be as simple as arrays, lists and hashtables, or as complex as binary tree derivatives or various styles of linked lists. This influences the responsiveness of playlists and searches which would consume a majority of the cpu time actual playback and misc processing notwithstanding.
(I will use banshee as my primary example since I have analyzed it's source code). Banshee uses sqlite for its backend. From what I've read, this seems to be a preferred low calorie sql oriented storage method. Internally it uses a managed array for its playlists, and Hashtables for its library (meta info, tracknames, etc). Ok fine, this is an acceptable way to do it and works fine for most applications.
So take this implementation and now put it under stress. Lets say we import 160Gb of mp3s, mpcs, m4as, flacs, oggs, and so on, all properly tagged. This obviously takes a while to process on the initial import so no big deal, go grab a cup of joe, or a smoke or what have you, even foobar takes its time during this process which is understandable. As an every day end user, my expectation for the next time I open my media player is to have it come up without any noticeable delay since all of the data has already been processed, packaged, and filed away, so in theory all it needs to do is do a quick sql query just to populate its internal library. Well in practice this is not what happens. Banshee can take as long as one minute before my library is presented to me and many times longer. Rhythmbox typically takes about 5 minutes before it begins to settle down(This surprises me since it uses some sort of tree structure internally). Either way, this is unbearable. Bmpx just crashes (super alpha software, so it's off the hook).
So let me get to the point now. Let's say I want to reimplement these features in these media players or maybe I was going to write a player from scratch. I now must make a decision on what data structures and backends would be most efficient for my purpose. So my question is what would a really intelligent developer do in this scenario. Obviously the solution is not to rely on vanilla language constructs like the existing players do, it would make more sense to use something a little more high performance. What I need is a data structure that can handle very large capacities of data and has respectable performance. Ideally what I would like is to get ahold of the foobar2000 source code and see wwpd (what would peter do?). It is apparent to me that whatever data structure/algorithm he uses to manage the internal library and playlists clearly rocks. Anyone have any clue regarding his secret ingredients?
I have been doing some reading on efficient data structures what seems to have the most potential in this scenario for me is the SkipList. This structure seems like it would perform phenomenally in rapid searching, requiring only a handful of comparisons/operations (12-18) to search a list numbering in the hundreds of thousands. This sounds like a pretty good deal compared to other methods. The best explanation for how a skiplist works I found here: MSDN Data Structues (Skiplist). What is the general concensus on this? Would anyone agree or disagree that this would be an ideal method? Is there something better?
With that out of the way, I would then want to know what would be an ideal backend data store? Performance being the key motive, would it be worthwhile to just serialize our skiplist into a binary file or are relational databases more suitable? I'm not at my personal computer right now so I can't analyze foobar's database file but I assume it's a binary format of Peter's design?
I would certainly appreciate any insight into these types of problems. I pose the question on HA because I know almost everyone here is a developer (some for media players), not to mention there are some brilliant mathematical minds as well. I am even open to the theoretical.
On a tangent, if these points were optimized on a linux based player, combined with the recent ability of mono to support Windows.Forms, it does not seem like such a stretch that a near identical clone of foobar2000 could actually come to life on the linux platform. Any plans for peter to embrace .net?
