Friday, October 4, 2013

F# Collections - Part 1

F# has three different collection types to hold values of the same type

  1. Array - A fixed-size, zero-based, mutable collection
  2. List - An ordered, immutable linked list
  3. Sequence -  A logical series of elements

They have similarities but also many differences. Very few can probably remember all the functions that can be applied for a given type (Intellisense will help!) but a solid grasp of many of the functions are necessary to have in memory. For my reference I wrote this sheet to list the functions, first and second column are the important ones. In the available column ‘a’ is array, ‘l’ list, and ‘s’ is sequence. Some descriptions are missing but I will add that later as I experiment (for full details go here).

Skip this list and goto my experiments below.

Function
Description
Available
append
Add elements and return new collection
als
average
Calculate average
als
averageBy
Calculate average with a function applied to each element
als
blit
Copy section
a
cache
Compute and store elemnts
s
cast
Convert to type
s
choose
Apply function and return Some
als
collect
Apply function, concatenate and return
als
compareWith
Compare with a function
s
concat
Combines
als
countBy
Appy key-generating function to each element and return
s
copy
Copy collection
as
create
Create array
a
delay

s
distinct
Return collection with no duplicates
s
distinctBy
Return collection with no dupliates according to equality function
s
empty
Create empty collection
als
exists
Tests if any elements satisfies condition
als
exists2
Test is pair of elements satisfies condition
as
fill
Set a range of elements to a given value
a
filter
Filter and return new collection
als
find
Return first found element
als
findIndex
Return index of first found element
als
fold
Apply function to each element, threading an accumulator argument
als
fold2
Apply function to each element in two collection, threading an accumulator argument
al
foldBack

al
foldBack2

al
forall
Test if all elements satisfies predicate
als
forall2
Test if all elements satisfy predicate pairwise
als
get/nth
Get element
als
head
Get first element
ls
init
Create collection given dimension and function
als
initInfinite
Generate sequence
s
isEmpty
Test is empty
als
iter
Apply function to each element
als
iteri

als
iteri2
Apply function to pair of elements
al
iter2

al
length
Return length
als
map
Build a collection by applying a function
als
map2
Build a collection by applying a function to two collections
l
map3

l
mapi
Build array
als
mapi2
Build collection
als
max
Return greatest element
als
maxBy
Return greatest element compared with function
als
min
Return smallest element
als
minBy
Return smallest element compared with function
als
ofArray,
ofList,
ofSeq
Create a new collection of a different type

pairwise

s
partition
Split collection into two collections
al
permute
Return array with all elements permuted
al
pick
Apply function to successive elements
als
readonly

s
reduce

als
reduceBack

l
replicate
Create list of specified length with elements set to given value
l
rev
Reverse collection
al
scan

als
scanBack

al
singleton

s
set
Set element to specified value
a
skip
Skip elements and return new collection
s
skipWhile

s
sort
Sort using compare
als
sortBy

als
sortInPlace

a
sortInPlaceBy

a
sortInPlaceWith

a
sortWith
Sort using comparison function
al
sub
Create array from subrange
a
sum
Return sum of elements
als
sumBy
Return sum of element with function applied
als
tail
Return list without first element
l
take
Return elements up to specified count
s
takeWhile

s
toArray,
toList,
toSeq
Create a new collection of a different type

truncate
Return a sequence with no more than N elements
s
tryFind

als
tryFindIndex

als
tryPick

als
unfold

s
unzip
Split list of pairs into two lists
al
unzip3
Split list of triples into three lists
als
windowed

s
zip
Combines the two collections into a list of pairs
als
zip3
Combines the three collections into a list of triples
als


There are several ways to create collections but the most basic is:

let a = [| 0 .. 10 .. 100 |]
let l = [ 0 .. 10 .. 100 ]
let s = { 0 .. 10 .. 100 }

The collections above have small footprints so let’s create larger collections and make a discovery. I add a stopwatch to determine the time for each binding.

let N = int(10e6)
open System.Diagnostics
let stopWatch = Stopwatch.StartNew()
let aXL = Array.init N (fun i -> i)
let aXLt = stopWatch.Elapsed.TotalMilliseconds
let lXL = List.init N (fun i -> i)
let lXLt= stopWatch.Elapsed.TotalMilliseconds - aXLt
let sXL = Array.init N (fun i -> i)
let sXLt = stopWatch.Elapsed.TotalMilliseconds - lXLt
stopWatch.Stop()


printf "%f %f %f" aXLt lXLt sXLt

The output I get is: 35.113200 1038.648800 72.062900
Initializing the list takes considerable longer and can be explained that lists are in fact linked lists.

Now let’s try to update the element at in the middle (index 5). For the array you can do this:

a.SetValue(0, 5)

But for both the list and sequence you will discover that there is not straightforward way in doing the same. Only array elements are mutable as I wrote in the first paragraph.
To get a single element and bind to a new value:

            let a5 = Array.get a 5
       let l5 = List.nth l 5
       let s5 = Seq.nth 5 s

For arrays and list it can also be written:

            let a5 = a.GetValue(5)
       let l5 = l.Item(5)

The same collections can also be created as

let a = Array.init 11 (fun n -> n * 10)
let l = List.init 11 (fun n -> n * 10)

let s = Seq.init 11 (fun n -> n * 10)

By now you have probably seen that ‘s’ has a different signature

                val s : seq<int>

It is because the sequence is not actually evaluated when created. The sequence is represented as System.IEnumerable which means that the sequence is lazily evaluated. With lazy evaluation it is possible to create infinite collections, which otherwise would consume an infinite amount of memory 

            let sInfinite = Seq.initInfinite (fun n -> n * 10)
       Seq.nth 10 sInfinite
       Seq.nth 2147483647 sInfinite

The third row will take a while to evaluate, as it’s the largest possible value for a 32 bit int, and might not evaluate to the number you expect.

The average of a integer collection can easily be calculated

           Array.averageBy (fun i -> float i) a
      List.averageBy (fun i -> float i) l
      Seq.averageBy (fun i -> float i) s

Let’s try this on a larger collection and measure the performance

open System.Diagnostics
let stopWatch = Stopwatch.StartNew()
let aa = Array.averageBy (fun i -> float i) a
let aat = stopWatch.Elapsed.TotalMilliseconds
let la = List.averageBy (fun i -> float i) l
let lat= stopWatch.Elapsed.TotalMilliseconds - aat
let sa = Seq.averageBy (fun i -> float i) s
let sat = stopWatch.Elapsed.TotalMilliseconds - lat
stopWatch.Stop()


printf "%f %f %f " aat lat sat

You’ll find that the sequence is the slowest.

Since these collections are similar but have differences there will be situations where one collection needs to be casted to another type. There are two functions to use for casting, e.g. for an array there are both toList and ofList. The former will cast the array to a list and the latter will cast a list to an array. This means that there are two ways of casting an array to a list , i.e. Array.ToList and List.OfArray. The obvious question is what’s the difference?  To find out I decompiled the Array.Module which holds the functions.

    public static FSharpList<T> ToList<T>(T[] array)
    {
      if ((object) array == null)
        throw new ArgumentNullException("array");
      else
        return List.ofArray<T>(array);
    }

    public static T[] OfList<T>(FSharpList<T> list)
    {
      if ((object) array == null)
        throw new ArgumentNullException("list");
      else
        return List.toArray<T>(list);

    }

The function makes a call to a function in the List module.

    public static FSharpList<T> OfArray<T>(T[] array)
    {
      return List.ofArray<T>(array);
    }

    public static T[] ToArray<T>(FSharpList<T> list)
    {
        return List.toArray<T>(list);

    }

Examining this we find that using ToList() in the Array Module and using ofArray() in the List Module is in fact the same thing except for the if condition equals null. Note the difference of the static method with the first capital capital and calls to internal methods with first character lowercase. All conversion is handled by internal methods of List.

        internal static T[] toArray<T>(FSharpList<T> l)
    {
      T[] res = new T[l.Length];
      List.loop<T>(res, 0, l);
      return res;
    }

    internal static FSharpList<T> ofArray<T>(T[] arr)
    {
      int length = arr.Length;
      FSharpList<T> tail = FSharpList<T>.get_Empty();
      int index = length - 1;
      int num = 0;
      if (num <= index)
      {
        do
        {
          tail = FSharpList<T>.Cons(arr[index], tail);
          --index;
        }
        while (index != num - 1);
      }
      return tail;
    }

Its worth looking at this for a minute and understand how it works. Both conversion to and from Array uses an iteration, when converting to an array its straightforward but the other conversion is a bit complex, with a downward counting while loop and list concatenation.

The same symmetry can be expected for Array.ToSeq and Seq.OfArray and I will show that it is the case here.

Decompiled Array module:

      public static IEnumerable<T> ToSeq<T>(T[] array)
    {
            if ((object) array == null)
                throw new ArgumentNullException("array");
            else
                return SeqModule.OfArray<T>(array);
     }

Decompiled Seq Module:

    public static IEnumerable<T> OfArray<T>(T[] source)
    {
            if ((object) source == null)
                throw new ArgumentNullException("source");
            else
return new SeqModule.OfArray <T>(source));

     }

To finish this post I’ll sum up some important traits for each collection and when and how one would choose one over another.

  •  Seq and Array are better than List for parallelism
  •  Expose Seq to public API, use List and Array only internally
  • Use Array if data is rarely added (or added in larger groups), or if size is known at initialization
  • Since a (Linked) List are easy to add to/remove from use List if many add/inserts will be made
  • Use List in recursive processing with the head::tail pattern
  • Use Seq or Array for large collections since List consumes much more memory
  • Use List if you are holding immutable data and any of the other conditions suggest List
  • Use Seq for large collections but don’t expect to use all of the element, or don’t want all elements in memory at the same time
  • Seq is an abstract type and List and Array are automatically sequences. Seq support lazy evaluation. Therefore Seq can be used by default as the concrete type doesn't matter.




1 comment:

  1. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing data science training in Hyderabad

    ReplyDelete