Selecting appropriate partitioning strategy in Task Parallel Library- Part I

When you supply any data to a parallel loop in Task Parallel Library (TPL), it is automatically broken into partitions and these partitions are allocated to tasks. Otherwise if loops were to pick each item individually, the synchronization required to avoid the same item being picked by multiple items would have overweighed the gains of parallelism.

The role of a partitioner is to break the data source into partitions. We have a number ways of doing this partitioning, and the way it is done can have an impact on the performance of the parallel loop. The default partitioner offers good performance but you can design your own partitioning strategy that suits your needs or the data source.

Here in this blog I’m going to show you how the default partitioner in Task Parallel Library works and how you can improve the performance for small loop bodies using chunking.

In part II of this series, I’ll explain how and when to create a custom partitioning.

The tasks in the loops are the consumers who take partitions from the partitioner and process them by taking the items one by one and invoke the loop body.

Following are the types of partitioner.

1. Dynamic Partitioner:  It has no knowledge in advance of how many partitions are going to be requested, so it divides up the data dynamically. Parallel loops use dynamic partitions.

2. Static Partitioner: Here you can simply break up the data into blocks.

A dynamic partitioner provides a block of data to the consumer so that the consumer should start its processing. Once the consumer process all the items from that block the partitioner will provide the next one, and the process will continue until all the blocks are used.

The creation of partitions depends on the number of consumers, for a single consumer there will be one partition and the whole block will be passed to that, which will process all the items. For multiple consumers multiple partitions will be created and will be shared among them.

Each consumer receives and processes a portion of the data items.

When the default partitioner is used it creates small blocks, means blocks which have very few items because small blocks are a better choice when there is relatively small data items.  The reason is that the small blocks can be spread across available cores (CPU).

The simple rule is that the small blocks work well for small data set while the large blocks work well for large data sets. And this is the reason that the default partitioner doubles the size of the blocks when it finds that it has given three blocks to the partition.

The default partitioner which is used in parallel.For loop works differently, It breaks the complete data source into ranges and these ranges can then be processed in parallel which is very similar to the chunking strategy which I’m going to explain in the next section.

Using Chunking:

By increasing the block size when the number of items processed increases the default partitioner does its best to give the good performance, but sometimes the problem may not be in the data source.

For example if the body of the loop is very small and if we have small blocks of data, in that case the overhead of invoking the delegate will overweight the gains of parallelism.

In the following example, consider the first parallel loop in the following listing which has a very small loop body; it simply calculates the square root of the index value and assigns the result to an array.

public class LoopBodySize
{delegate void   CalculateValue(int value);static double[] resultData =   new double[10000000];static void Main(string[]   args)
{// perform the parallel loopParallel.For(0, resultData.Length, (int index) =>{
// compuute the result for the   current index
resultData[index] = Math.Sqrt(index);

// perform the loop again, but   make the delegate explicit

Parallel.For(0, resultData.Length, delegate(int index)

resultData[index] = Math.Sqrt((double)index);


// perform the loop once more,   but this time using

// a declared delegate and   action

CalculateValue pdel = new CalculateValue(CalculateResultValue);

Action<int> paction =   new Action<int>(pdel);

Parallel.For(0, resultData.Length, paction);

// wait for input before exiting

Console.WriteLine(“Press enter to finish”);



static void   CalculateResultValue(int indexValue)
resultData[indexValue] = Math.Sqrt(indexValue);
} }

This example shows that whatever way you opt to call the function the overhead to invoke the delegate is huge, in comparison to executing the small body of the loop.

The problem we saw in the last example can be resolved by using the partitioning technique called chunking. Instead of using a Parallel.For() loop that invokes the delegate for each index value, we transform the index range into a series of System.Tuple<int, int> instances, where each Tuple represents a chunk or range of index values. These ranges are then processed by a Parallel.ForEach() loop.

See the following listing

public class Chunking
static void Main(string[]   args)
// create the results arraydouble[] resultData = new   double[10000000];// created a partioner that will chunk the dataPartitioner<Tuple<int, int>> chunkPart =Partitioner.Create(0, resultData.Length, 10000);// perform the loop in chunks
Parallel.ForEach(chunkPart, chunkRange =>
// iterate through all of the   values in the chunk rangefor (int i = chunkRange.Item1;   i < chunkRange.Item2; i++)
{resultData[i] = Math.Sqrt(i);}


// wait for input before exiting

Console.WriteLine(“Press enter to finish”);



We can perform chunking using the static Create() method of the Partitioner class. This method takes a start index, an end index, and optionally the range of index values that each chunk should represent.

The Partitioner.Create() Method Versions

Method Overload Description
Create(int, int)Create(long, long) Chunk a range of values using the   default range size.
Create(int, int, int)Create(long, long, long) Chunk a range of values using a   specified range size.

When we don’t specify the size of each chunk, the default one will be used; and currently it is as follows

Chunk size = (the number of items)/ (3*number of processor cores available)

The default may be calculated differently in future releases, so you shouldn’t code in the expectation of this value.

The Create method of Partitioner class returns an object of  Partitioner. This object can be passed as an argument to the Parallel.ForEach method with a loop body which will process the instances of Tuple<int, int>. Item1 of tuple is the start index of the range and Item2 is the exclusive end index. So instead of the loop body processes the individual items we are passing a range of values like this.

// perform the loop in chunks

Parallel.ForEach(chunkPart, chunkRange => {

// iterate through all of the values in the chunk range

for (int i = chunkRange.Item1; i < chunkRange.Item2; i++) {

resultData[i] = Math.Sqrt(i);



So we have broken the 10,000,000 index values into chunks of 10,000, in order to reduce the number of times that the delegate is invoked to 1,000.

Ordered Default Partitioning Strategy

The Parallel.ForEach() method uses default partitioning strategy if you don’t specify any, or you can provide an instance of Partitioner or OrderablePartitioner classes which are given in the System.Collections.Concurrent namespace.

The OrdereablePartitioner is a child class of Partitioner. In the last example we used an instance of Partitioner class when we called Parallel.ForEach().

Partitioner doesn’t have any relevance when chunking.

Sometimes we need to preserve the order of the items, if that is the case then we can use the OrderablePartitioner instead of Partitioner.

The following example explains how do we use an OrderablePartitioner.

Orderable Partitioner

public class OrderedPartitioner

static void Main(string[]   args)
// create the source dataIList<string> sourceData= new List<string>() {   “It”, “is”, “better”, “to”,   “be”, “hated”, “for”, “what”, “you”,   “are”, “than”, “to”, “be”, “loved”,   “for”, “what”, “you”, “are”   ,”not”};// create an array to hold the   resultsstring[] resultData = new   string[sourceData.Count];// create an orderable   partitionerOrderablePartitioner<string> op =   Partitioner.Create(sourceData);// perform the parallel loopParallel.ForEach(op, (string item, ParallelLoopState loopState, long   index) =>{

// process the item

if (item == “hated”)   item = “disliked”;

// use the index to set the   result in the array

resultData[index] = item;


// print out the contents of   the result array

for (int i = 0; i <   resultData.Length; i++)


Console.WriteLine(“Item {0} is {1}”, i, resultData[i]);


// wait for input before   exiting

Console.WriteLine(“Press enter to finish”);


} }

Now what is the difference? The difference is that the loop body is passed an additional argument, and that is the numeric key for the item being processed. In simple Partitioner we don’t have this information, but in OrderablePartitioner we have it.

Partitioner.Create() Method Versions

Method Overload Description
Create(IEnumerable<T>) Create an   OrderablePartitioner using an IEnumerable<T> as the source.Create(T[],   bool)
Create(T[],   bool) Create an OrderablePartitioner   using an array as the source. The bool argument determines if the partitioner   will support dynamic artitions; this must always be true for use in parallel   loops.
Create(IList<T>,   bool) Create an OrderablePartitioner   using an IList<T> as the source. The bool argument determines if the   partitioner will support dynamic partitions; this must always be true for use   in parallel loops.

So the summary is that though the default partitioner is smart enough to decide the appropriate size the of the partition but still chunking can help you in a number of scenarios.

And to preserve the order of the items, OrderablePartitioner can help you.

Leave a Reply

Your email address will not be published. Required fields are marked *

4 + = eleven

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>