Decreasing KMeans Clustering RAM Consumption with Sparse Vectors

I’m reading “Building Machine Learning Systems with Python” (Packt Publishing) at the moment and there’s an interesting chapter in it on finding similar documents using Tf-Idf, so I thought I’d expand on it and try out using KMeans clustering to group together similar documents.

Grouping together similar documents automatically could be quite useful for improving the quality of statistical machine translation systems which rely on being trained on multilingual content in a similar domain (a fair amount of research has been done on this already) and for automatically detecting the domain of text.

Even though I use Python a bit, I’m most productive in C# and it’s the language I use at work so I used that. Unfortunately, while the Tf-Idf algorithm is pretty simple, C# implementations I found on the Web were not CPU or RAM efficient and would have collapsed under any serious volume of data, they also didn’t support multi-threading very well (or at all), so I had to write my own version which makes use of the ConcurrentDictionary rather than lists of content.

No Scipy meant that I also had to write the code for creating a vector suitable for use in a Machine Learning algorithm too (ConvertTfIdfToVector).

using LemmaSharp;
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace TfIdf
{
    public class TermFrequencyInverseDocumentFrequency
    {
        /// <summary>
        /// The number of times a token appears in a document within the corpus.
        /// </summary>
        public ConcurrentDictionary<string, int> CorpusFrequency { get; set; }
        
        /// <summary>
        /// A dictionary of all the words in the document and their position in
        /// the output vector.  The first word encountered will be at position zero,
        /// the last new word encountered will have the largest value.
        /// </summary>
        public ConcurrentDictionary<string, int> DistinctWords { get; set; }

        /// <summary>
        /// The number of documents added to the corpus.
        /// </summary>
        public int DocumentCount { get; set; }

        LemmatizerPrebuiltCompact lemmatizer;
        static Regex wordBoundaryRegex = new Regex(@"\b", RegexOptions.Compiled);

        public TermFrequencyInverseDocumentFrequency(CultureInfo culture)
            : this(GetLemmatizer(culture))
        {
        }

        private TermFrequencyInverseDocumentFrequency(LanguagePrebuilt language)
        {
            this.CorpusFrequency = new ConcurrentDictionary<string, int>();
            this.lemmatizer = new LemmatizerPrebuiltCompact(language);
            this.DistinctWords = new ConcurrentDictionary<string, int>();
        }

        private static LanguagePrebuilt GetLemmatizer(CultureInfo culture)
        {
            while (culture.Parent != null && culture.Parent != CultureInfo.InvariantCulture)
            {
                culture = culture.Parent;
            }

            switch (culture.Name)
            {
                case "bg":
                    return LanguagePrebuilt.Bulgarian;
                case "cs":
                    return LanguagePrebuilt.Czech;
                case "et":
                    return LanguagePrebuilt.Estonian;
                case "fa":
                    return LanguagePrebuilt.Persian;
                case "fr":
                    return LanguagePrebuilt.French;
                case "hu":
                    return LanguagePrebuilt.Hungarian;
                case "mk":
                    return LanguagePrebuilt.Macedonian;
                case "pl":
                    return LanguagePrebuilt.Polish;
                case "ro":
                    return LanguagePrebuilt.Romanian;
                case "ru":
                    return LanguagePrebuilt.Russian;
                case "sk":
                    return LanguagePrebuilt.Slovak;
                case "sl":
                    return LanguagePrebuilt.Slovene;
                case "sr":
                    return LanguagePrebuilt.Serbian;
                case "uk":
                    return LanguagePrebuilt.Ukrainian;
                case "de":
                    return LanguagePrebuilt.German;
                case "it":
                    return LanguagePrebuilt.Italian;
                case "es":
                    return LanguagePrebuilt.Spanish;
                case "en":
                default:
                    return LanguagePrebuilt.English;
            }
        }

        /// <summary>
        /// Used to continue calculation of the corpus term frequency.
        /// </summary>
        /// <param name="document"></param>
        public void AddDocumentToCorpus(IEnumerable<string> document)
        {
            foreach(var token in document.SelectMany(sentence => SplitAndLemmatise(sentence)).Distinct())
            {
                CorpusFrequency.AddOrUpdate(token, 1, (key, value) => value + 1);
                DistinctWords.TryAdd(token, this.DistinctWords.Count);
            }

            this.DocumentCount++;
        }

        /// <summary>
        /// Used for unit testing to set the corpus data, instead of adding it via the 
        /// AddDocumentToCorpus() method.  This simplifies creation of test corpus data.
        /// </summary>
        /// <param name="tokensAndCount"></param>
        /// <param name="totalDocuments"></param>
        public void AddDocumentDataToCorpusForUnitTest(Dictionary<string, int> tokensAndCount, int totalDocuments)
        {
            this.DocumentCount = totalDocuments;

            foreach(var item in tokensAndCount)
            {
                this.CorpusFrequency.AddOrUpdate(item.Key, item.Value, (k, v) => item.Value);
            }
        }

        /// <summary>
        /// Calculates the TfIdf for a document.
        /// </summary>
        /// <param name="document">A document containing sentences.</param>
        /// <returns>A dictionary of terms and their corresponding TfIdf values</returns>
        public Dictionary<string, double> CalculateTfIdf(IEnumerable<string> document)
        {
            var wordsInDocument = new ConcurrentDictionary<string, int>();
            int documentWordCount = 0;

            foreach (var sentence in document)
            {
                foreach (var word in SplitAndLemmatise(sentence))
                {
                    wordsInDocument.AddOrUpdate(word, 1, (key, value) => value + 1);
                    documentWordCount++;
                }
            }

            return wordsInDocument.ToDictionary(kvp => kvp.Key, kvp =>
            {
                int documentFrequency = kvp.Value;
                double tf = documentFrequency / (double)documentWordCount;

                double idf = CalculateInverseDocumentFrequency(kvp.Key);

                return tf * idf;
            });
        }

        public double CalculateInverseDocumentFrequency(string token, bool retokenize = false)
        {
            token = retokenize ? lemmatizer.Lemmatize(token.Trim().ToLowerInvariant()) : token;

            bool isWordPresentInCorpus = this.CorpusFrequency.ContainsKey(token);

            if (isWordPresentInCorpus)
            {
                int numberOfTimesTheTokenIsPresentInTheCorpus = CorpusFrequency[token];

                return Math.Log(this.DocumentCount / (double)(numberOfTimesTheTokenIsPresentInTheCorpus));
            }
            else
            {
                return 0d;
            }
        }

        IEnumerable<string> SplitAndLemmatise(string sentence)
        {
            foreach (var word in wordBoundaryRegex.Split(sentence)
                .Where(w => !string.IsNullOrWhiteSpace(w))
                .Where(w => w.Any(c => char.IsLetterOrDigit(c)))
                .Select(w => w.Trim().ToLowerInvariant()))
            {
                yield return lemmatizer.Lemmatize(word);
            }
        }

        public double[] ConvertTfIdfToVector(Dictionary<string, double> tfIdf)
        {
            var rv = new double[this.DistinctWords.Count];

            foreach (var item in tfIdf)
            {
                // Look up the index of the word in the corpus and set the vector value.
                int index = 0;
                if (this.DistinctWords.TryGetValue(item.Key, out index))
                {
                    rv[index] = item.Value;
                }
            }

            return rv;
        }

        public static string CalculateVectorHash(double[] vector)
        {
            using (var sha256 = SHA256Managed.Create())
            {
                var sb = new StringBuilder();

                foreach (var b in sha256.ComputeHash(ConvertVectorToStream(vector)))
                {
                    sb.Append(b.ToString("x2"));
                }

                return sb.ToString();
            }
        }

        private static Stream ConvertVectorToStream(double[] vector)
        {
            var ms = new MemoryStream();

            foreach (var d in vector)
            {
                var bytes = BitConverter.GetBytes(d);
                ms.Write(bytes, 0, bytes.Length);
            }

            ms.Position = 0;

            return ms;
        }
    }
}

I often use Wikipedia or Gutenberg books as test data, since the data is freely available, so I pulled in 1000 books at random and used the Encog Machine Learning framework’s KMeans implementation to cluster the vectors together.

I first noticed that the RAM consumption on my PC was pretty high and the reason was obvious enough. Each book is reduced to an array of doubles (64 bits) with the same length as all of the distinct words found in all of the books, so you end up with each array taking up more than 1MB of RAM.

Encog’s input formats don’t currently allow for tagging the input data with extra metadata, you only get the cluster id and the input data. This is a problem, because when you get the result out of the clustering, you need a way to reference it back to the source of the data (e.g. filename or similar).

My first implementation just stored two copies of the data, one copy in Encog, and one copy used to lookup the origin of the data (e.g. the name of the book).

So, not only were the arrays big, but I had two copies…

To skip storing two full copies, I hashed the vector into a string and stored that instead because it’s much less data to store. To look up the metadata at the end, I can pass the data output from Encog through the same hashing algorithm and get the same string, just like looking up a password hash. The code for that is also in the gist above.

The Encog RAM consumption was still high, but I’d read about Numpy’s built-in support for sparse matrices, which don’t save every single value in an array like Encog, but instead store the index position and the data, so for sparse matrices, you can save a lot of RAM and with a couple of hours hacking on Encog I was able to create a new type of input data in Encog called SparseMLData and submit a pull request to the project: [0]

https://github.com/encog/encog-dotnet-core/pull/65 [0]

Here’s a graph showing the reduced RAM consumption of the clustering performance over a set of 500 books. Happy days.

So what were the results like? On first few passes, mixed… I can see that the data is being clustered somewhat effectively, but I think I need much bigger sets of data or to generate some supervised data for validation…