Archive

Optimization

While experimenting with the nuances of vlookup, I noticed an interesting pattern in the way that Excel conducts searches. The searches begin from the bottom of a list and work upwards sequentially. The byproduct of excel using this method is that vlookups in excel become exponentially slower as the lookup table grows. As I brainstormed for ways to alleviate this issue, the common solution to all sorts of computer science problems turned out to do the trick. My experimenting produced a mish-mash of excel formulas that can use a piece-wise approach to drastically improve the performance of large lookups.

The only requirement for method is to have a sortable data-set and lookup table. The idea is to shrink the lookup area as you exhaust your search so that, with an inversely sorted data set, you continuously help excel find a match in the first few rows it checks, and then eliminate rows that would never contain the value in the first place. A more detailed explanation is below for those who are interested.

The Process:

The first step is to setup the data correctly so that Excel can find values as quickly as possible. Sort the Dataset in ascending order and the Lookup table in descending order. (Make sure the sort area covers the entire table to avoid scrambling the dataset)

Create empty columns for the values you are looking up as well as two extra blank columns.

In the first extra blank column write the following match function to get the row number of the object in the lookup table:

=IFERROR(MATCH(<Dataset value>,<lookup-table value array>, 0),1)
E.g. =IFERROR(MATCH(A2,Sheet1!$A$2:$A$100,0),1)

Once that formula is written, we want to repeat it once every few hundred rows. You can either use a macro of use the nifty auto-fill options in Excel. To use the autofill write the formula in the first cell, select that cell and a few hundred blank cells underneath it, then double click the bottom corner of the selected region to auto-fill the entire column. The last step is to add a 1 to the last row of the column.

Once this is complete, we have a reference column for our dynamic lookup. In the second extra blank column we need to populate the row numbers that appear every few hundred rows into all the blanks. This can be done with an if statement which takes the match value if it exists in the first column, or else takes the value from the cell under it (which should have the next match value):

=IF(<returned match> > 0, <returned match>, <cell directly under>)
E.g. =IF(A2> 0, A2,B3)

You should now have two columns, one with a value every few hundred rows which pulls the matching row number (or a blank if it doesn’t exist), and a full list of row numbers.

Now we can create the vlookup and link it to a dynamic reference that shrinks as we search. The INDIRECT function in excel allows you to create a reference out of a string.The idea is that, if we’re searching for the value “5” in a list of numbers from 1-10, once we get to 5, we can skip searching through items 1-4 and start at 5-10 to find a match.

In the column write a normal vlookup:

=VLOOKUP(<lookup value>, <lookup table>, <column to fetch>, FALSE)
E.g. =VLOOKUP(A2,Sheet1!$A$2:$B$100,2,FALSE)

Now replace the array reference with the dynamic reference:

E.g. = VLOOKUP(A2,INDIRECT(“Sheet1!$A$” & <REFERENCE TO COLUMN OF ROW NUMBERS> & “:$B$100),2,FALSE)

Now as we copy the vlookup, the reference for the lookup table will resize according to the row number in the previously populated column and aid excel in shrinking the area it needs to search to find a value.

This hack increases the speed of excel lookups by 10x-100x! (A speed test with a 500,000 row to 500,000 row lookup improved times from 1+ hours to 1 minute 13 seconds)  The same principles can also be applied to an Index Match carry have the same advantages!

Optional Explanation:

Take for example two tables with values 1-10 for which we want to use a vlookup.

The reason this works is that excel  vlookups conduct their searches with a simple row by row search starting from the bottom and moving up. By sorting the columns in opposite directions, the first object you look up will be the last object in the lookup table which means that the lookup will complete instantly. Excel won’t have to iterate through the entire lookup table to find the value. However, merely sorting the datasets won’t alleviate the problem because no matter what, as we iterate through the lookup there will be values that take the worst-case time to find because they are at the top of the lookup list. This contributes to immense slowdown in the process.

This optimization works by shrinking the table as we do the lookup and making sure that the value that excel is searching for is one of the first ones in the lookup table. This example illustrates my logic:

We have two lists A-Z we want to lookup against. We search for the first value, A, in the lookup table from the bottom up. If we sort the lookup table A-Z then we will have to look through 26 values (Z, Y, X, W, … etc.) until we get to A to find it. If we sort it in inverse order we will get the A match on the very first search. Now we move on to B. Because we know that we’re on B now, we no longer need to check A to see if it matches. Therefore if we shrink the lookup table to Z-B, we will find B on the first try as well. This continues with every iteration and ensures that we always find the value quickly in the lookup table.

Although the usage of INDIRECT in the lookup adds another calculation to the normal lookup, it adds to the number of calculations linearly and reduces total calculations exponentially. The larger the data-set the better the improvement.

One piece of advice to note is that, even after developing this method, I’ve refrained from using vlookups whenever possible. What I’ve discovered in my time handling large datasets and doing analyses is fundamental skills such as query building reduce troubles along the way for every task. As I work with larger and larger datasets, the ability to craft a query and use SQL to manipulate the data as much as possible has dramatically reduced my dependence on excel and increased my ability to work with any dataset.

Advertisements