I’m reading “Building Machine Learning Systems with Python” (Packt Publishing) at the moment and there’s an interesting chapter in it on finding similar documents using Tf-Idf, so I thought I’d expand on it and try out using KMeans clustering to group together similar documents.
Grouping together similar documents automatically could be quite useful for improving the quality of statistical machine translation systems which rely on being trained on multilingual content in a similar domain (a fair amount of research has been done on this already) and for automatically detecting the domain of text.
Even though I use Python a bit, I’m most productive in C# and it’s the language I use at work so I used that. Unfortunately, while the Tf-Idf algorithm is pretty simple, C# implementations I found on the Web were not CPU or RAM efficient and would have collapsed under any serious volume of data, they also didn’t support multi-threading very well (or at all), so I had to write my own version which makes use of the ConcurrentDictionary rather than lists of content.
No Scipy meant that I also had to write the code for creating a vector suitable for use in a Machine Learning algorithm too (ConvertTfIdfToVector).
I often use Wikipedia or Gutenberg books as test data, since the data is freely available, so I pulled in 1000 books at random and used the Encog Machine Learning framework’s KMeans implementation to cluster the vectors together.
I first noticed that the RAM consumption on my PC was pretty high and the reason was obvious enough. Each book is reduced to an array of doubles (64 bits) with the same length as all of the distinct words found in all of the books, so you end up with each array taking up more than 1MB of RAM.
Encog’s input formats don’t currently allow for tagging the input data with extra metadata, you only get the cluster id and the input data. This is a problem, because when you get the result out of the clustering, you need a way to reference it back to the source of the data (e.g. filename or similar).
My first implementation just stored two copies of the data, one copy in Encog, and one copy used to lookup the origin of the data (e.g. the name of the book).
So, not only were the arrays big, but I had two copies…
To skip storing two full copies, I hashed the vector into a string and stored that instead because it’s much less data to store. To look up the metadata at the end, I can pass the data output from Encog through the same hashing algorithm and get the same string, just like looking up a password hash. The code for that is also in the gist above.
The Encog RAM consumption was still high, but I’d read about Numpy’s built-in support for sparse matrices, which don’t save every single value in an array like Encog, but instead store the index position and the data, so for sparse matrices, you can save a lot of RAM and with a couple of hours hacking on Encog I was able to create a new type of input data in Encog called SparseMLData and submit a pull request to the project: https://github.com/encog/encog-dotnet-core/pull/65
Here’s a graph showing the reduced RAM consumption of the clustering performance over a set of 500 books. Happy days.
So what were the results like? On first few passes, mixed… I can see that the data is being clustered somewhat effectively, but I think I need much bigger sets of data or to generate some supervised data for validation…