Wednesday, December 3, 2014

Word Frequency Dictionary & Sorted Occurrence Ranking Dictionary for Generic Item Quantification and other Statistical Analyses




Taking a little inspiration from DotNetPerl's MultiMap, I created a generic class that uses a Dictionary under the hood to create a keyed collection that tracks the re-occurrences or frequency matching or duplicate items of arbitrary type. For lack of a better name, I call it a FrequencyDicionary. Instead of Key/Value pairs, the FrequencyDictionary has the notion of a Item/Frequency pair, with the key being the Item. I strayed from Key/Value because I may ultimately end up calling the Item the Value.

I invented the FrequencyDictionary while writing a pisg-like IRC log file statistics generator program. It works well for word frequency analysis. After some pre-processing/cleaning of the text log file, I parse each line by spaces to get an array of words. I then use FrequencyDictionary.AddRange to add the words to my dictionary.

The FrequencyDictionary is essentially a normal Dictionary (with T being type String in this case), however when you attempt to add a key that already exists, it simply increments the Value integer. After processing the file, you essentially have a Dictionary where the Key is a word that was found in the source text, and the Value is the number of occurrences of that word in the source text.

Taking the idea even further, I created complementary Dictionary that essentially a sorted version of FrequencyDictionary with the Key/Values reversed. I call it a RankingDictionary because the Keys represent the number of occurrences of a word, with the Values being a List of words that occurred that many times in the source text. Its called a RankingDictionary because I was using it to produce a top 10 ranking of most popular words, most popular nicks, ect.

The FrequencyDictionary has a GetRankingDictionary() method in it to make the whole thing very easy to use. Typically I don't use the FrequencyDictionary too extensively, but rather as a means to get a RankingDictionary, which I base the majority of my IRC statistics logic on. The RankingDictionary also proved very useful for finding Naked Candidates and Hidden Pairs or Triples in my Sudoku solver application that I will be releasing on Windows, Windows Phone and will be blogging about shortly. Hell, I was even thinking about releasing the source code to my Sudoku App, since its so elegant and a great example of beautiful, readable code to a complex problem.

Anyways, the code for the Frequency and Ranking Dictionary is heavily commented with XML Documentation Comments, so I'm going to go ahead and let the code speak for itself. I will add usage examples later. In fact, I will probably release the pisg-like IRC Stats Prog source code since I don't think I'm going to go any farther with it.

Limitations: I ran into problems trying to parse large text files approaching 3-4 megabytes. Besides taking up a bunch of memory, the Dictionary begins to encounter many hash collisions once the size of the collection gets large enough. This completely kills performance, and eventually most the time is spent trying to resolve collisions, so it grinds to a near halt. You might notice the constructor public FrequencyDictionary(int Capacity), where you can specify a maximum capacity for the Dictionary. A good, safe ceiling is about 10,000. A better implementation of the GetHash() method might be in order, but is not a problem I have felt like solving yet.




/// <summary>
/// A keyed collection of Item/Frequency pairs, (keyed off Item).
/// If a duplicate Item is added to the Dictionary, the Frequency for that Item is incremented.
/// </summary>
public class FrequencyDictionary<ITM>
{
  // The underlying Dictionary
  private Dictionary<ITM, int> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public FrequencyDictionary() { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that has a maximum Capacity limit.
  /// </summary>
  public FrequencyDictionary(int Capacity) { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Gets a collection containing the Items in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<ITM> ItemCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Values;} }
  
  /// <summary>
  /// Adds the specified Item to the FrequencyDictionary.
  /// </summary>
  public void Add(ITM Item)
  {
    if  ( this._dictionary.ContainsKey(Item))   { this._dictionary[Item]++; }
    else    { this._dictionary.Add(Item,1); }
  }
  
  /// <summary>
  /// Adds the elements of the specified array to the FrequencyDictionary.
  /// </summary>
  public void AddRange(ITM[] Items)
  {
    foreach(ITM item in Items) { this.Add(item); }
  }
  
  /// <summary>
  /// Gets the Item that occurs most frequently.
  /// </summary>
  /// <returns>A KeyValuePair containing the Item (key) and how many times it has appeard (value).</returns>
  public KeyValuePair<ITM,int> GetMostFrequent()
  {
    int maxValue = this._dictionary.Values.Max();
    return this._dictionary.Where(kvp => kvp.Value == maxValue).FirstOrDefault();
  }
  
  /// <summary>
  /// Gets the number of Item/Frequency pairs contained in the FrequencyDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the FrequencyDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<ITM,int>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the Frequency (occurrences) associated with the specified Item.
  /// </summary>
  public int this[ITM Item]
  {
    get { if (this._dictionary.ContainsKey(Item)) { return this._dictionary[Item]; } return 0; }
  }
  
  /// <summary>
  /// Creates a RankingDictionary from the current FrequencyDictionary.
  /// </summary>
  /// <returns>A RankingDictionary of Frequency/ItemCollection pairs ordered by Frequency.</returns>
  public RankingDictionary<ITM> GetRankingDictionary()
  {
    RankingDictionary<ITM> result = new RankingDictionary<ITM>();
    foreach(KeyValuePair<ITM,int> kvp in _dictionary)
    {
      result.Add(kvp.Value,kvp.Key);
    }
    return result;
  }
  
  /// <summary>
  /// Displays usage information for FrequencyDictionary 
  /// </summary>
  public override string ToString()
  {
    return "FrequencyDictionary<Item, Frequency> : Key=\"Item=\", Value=\"Frequency\"\".";
  }
}




And now the RankingDictionary:

/// <summary>
/// A keyed collection of Frequency/ItemCollection pairs that is ordered by Frequency (rank).
/// If an Item is added that has the same Frequency as another Item, that Item is added to the Item collection for that Frequency.
/// </summary>
public class RankingDictionary<ITM>
{
  // Underlying dictionary
  SortedDictionary<int,List<ITM>> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public RankingDictionary() { _dictionary = new SortedDictionary<int,List<ITM>>(new FrequencyComparer()); }
  
  /// <summary>
  /// The Comparer used to compare Frequencies.
  /// </summary>
  public class FrequencyComparer : IComparer<int>
  {
    public int Compare(int one,int two) { if(one == two) return 0; else if(one > two) return -1; else return 1; }
  }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the RankingDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the ItemCollection in the RankingDictionary.
  /// </summary>
  public IEnumerable<List<ITM>> ItemCollections   { get { return this._dictionary.Values; } }
  
  /// <summary>
  /// Adds the specified Frequency and Item to the RankingDictionary.
  /// </summary>
  public void Add(int Frequency, ITM Item)
  {
    List<ITM> itemCollection = new List<ITM>();
    itemCollection.Add(Item);
    // If the specified Key is not found, a set operation creates a new element with the specified Key
    this._dictionary[Frequency] = itemCollection;
  }
  
  /// <summary>
  ///  Gets the number of Frequency/ItemCollection pairs contained in the RankingDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the RankingDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<int,List<ITM>>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the ItemCollection associated with the specified Frequency.
  /// </summary>
  public List<ITM> this[int Frequency]
  {
    get
    {
      List<ITM> itemCollection;
      if (this._dictionary.TryGetValue(Frequency,out itemCollection)) return itemCollection;
      else return new List<ITM>();
    }
  }
  
  /// <summary>
  /// Displays usage information for RankingDictionary 
  /// </summary>
  public override string ToString()
  {
    return "RankingDictionary<Frequency, List<Item>> : Key=\"Frequency\", Value=\"List<Item>\".";
  }
}


Usage examples will be posted later.