Category Archives: Posts

Polaris: Providing Context Aware Navigation in Spreadsheets

We know that spreadsheets are error-prone. One of the possible causes for this is the user interface of a spreadsheet. In a spreadsheet, you define a model by entering formulas into cells. In these formulas, you can refer to the value of one or more other cells on the same worksheet, a different worksheet or even a different workbook. As soon as you enter the formula, the spreadsheet will hide the formula and present you with the result. The ease with which cells can be linked and the possibility to immediately see the results, make spreadsheet flexible and powerful. However, the exact same characteristics make it also very difficult to understand the relations between cells.

To make this a bit easier, we developed Polaris. It is an add-in for Microsoft Excel that makes the task of understanding and debugging formulas easier. Let’s illustrate this with a small example.

Formula with references to multiple sheets

Consider the formula in cell C5 .

=(Students!P18 * Parameters!C3 + Certificates!J15 * Parameters!C4) / (‘Country KPIs’!C4 + ‘Country KPIs’!F3) * ‘Country KPIs’!I4

When we inspect this formula we see that the formula contains three different references to cells on the worksheet ‘Country KPIs’. To fully understand the calculation we want to check the values for these references. Therefore, we navigate to the worksheet ‘Country KPIs’.

Context lost when switching sheets

Unfortunately, as soon as we activate this sheet, Excel activates the last selected cell on this sheet (which is not the cell we are interested in). And because it activates this cell, it will also show the formula for this cell in the formula bar, meaning that we lose the formula that we were analyzing. Now we have to remember the formula, the three references on the worksheet ‘Country KPIs’ that we wanted to check and we have to locate these cells manually. Not an easy task!

Polaris

This is where Polaris will help. As soon as you select a cell, it displays its formula at the top of the screen. As soon as you navigate to the sheet ‘Country KPIs’ we still will loose the formula in Excel’s formula bar, but Polaris will keep the formula visible for us at the top of the screen. Furthermore, it will automatically navigate to the first reference in the formula on this worksheet. With either a button or a shortcut key (alt + ]) we can toggle between the three references on this sheet. When we have checked the value we can easily navigate back to original cell C5 on the worksheet ‘Funding’.

Want to know more about Polaris?

We have written a short paper, prepared a little video, and of course, you can download the add-in to test it yourself.

Let us know what you think of this tool. Would it work for you?

BumbleBee: an Excel refactoring Add-In

There is a new version of the BumbleBee that contains new refactorings. Get it here for Office 2010 and up (Alternative 1, Alternative 2).
A spreadsheet with examples where the refactorings can be applied on can be downloaded here.

Two years ago, we presented the first version of the BumbleBee Excel refactoring Add-In.  With this version, you could rewrite formulas into other using a specialized formula transformation language. This version adds refactorings which were not possible to create using the transformation language. A full overview of the refactorings and implementation can be found in this thesis. All refactorings are accessible from the “refactor” context menu (which appears when you right click on a cell) this Add-In adds.

Furthermore this version is now based on XLParser, which means that it can now handle virtually all of your formulas and also works if you reference other sheets or files.

Extract Formula allows you to extract an expressions from one or more cells into other cell(s). Example applications are splitting a formula in two, or to extract a magic number in the whole spreadsheet into a unique cell.

Inline Formula is the opposite of Extract Formula and will replace all references to a cell with the formula that cell contains.

Introduce Cell Name is a enhancement of the Excel “Define Name” functionality. You can give a cell or range a name with this functionality, and all references to exactly that cell or range will be replaced by the new name.

Group References is a small refactoring which helps you clean up formulas. In function where argument order does not matter (such as SUM or COUNT) it will merge references into ranges where possible and order them alphabetically. For example =SUM(F1,F2,A1,A2,A3) will become =SUM(A1:A3,F1:F2) after applying this refactoring.

Change + to SUM(IF) will transform additions into SUMs, for example =A1 + A2 + A3 + A4 will become =SUM(A1:A4) (it actually works for * and & too). Furthermore if everything you sum is in a single column and there is a column which contains a unique value for every row you sum, it will convert a + or SUM into a SUMIF.

Code Smells in Spreadsheet Formulas Revisited on an Industrial Dataset

tl;dr


A few years ago, Felienne  adapted code smells to be applicable on spreadsheet formulas. The smell detection algorithm was validated on a small dataset of industrial spreadsheets by interviewing the users of these spreadsheets. Now a more in depth validation of the algorithm is done by analyzing a new set of spreadsheets of which users indicated whether or not they are smelly.

This new dataset gives us the unique possibility to get more insight in how we can distinguish `bad’ spreadsheets from `good’ spreadsheets. We do that in two ways: For both the smelly and non smelly spreadsheets we 1) have calculated the metrics that detect the smells and 2) have calculated metrics with respect to size, level of coupling. The results show that indeed the metrics for the smells decrease in spreadsheets that are not smelly. With respect to size we found to our surprise that the improved spreadsheets were not smaller, but bigger. With regard to coupling  both datasets are similar. It indicates that it is difficult to use metrics with respect to size or degree of coupling to draw conclusions on the complexity of a spreadsheet.


Introduction

We are probably all familiar with the concept of code smells. A few years ago Felienne adapted code smells to use them in spreadsheets. This year we had the opportunity to apply these code smells on a very interesting industrial dataset. But before I explain to you the uniqueness of this dataset, let’s do a quick review of the code smells. In her paper Felienne introduced the following code smells:

Smells

Multiple Operations: Inspired by the code smell Long Method, this smell indicates the length of the formula. It measures the total number of operations that a formula contains. The figure below shows a formula that is suffering from this smell with a total of 15 unique operations within the formula.

Multiple Operations - fifteen unique operations within one formula
Multiple Operations – fifteen unique operations within one formula

Multiple References: Another well known code smell is Many Parameters. The spreadsheet formula equivalent is Multiple References. It counts the number of ranges a formula is referring to. An example of this smell is a formula with 79 references.

Multiple References - Excel formula with 79 references
Multiple References – Excel formula with 79 references

Conditional Complexity: Many nested conditional operations are considered as a threat to code readability. The same is true for spreadsheet formulas. The Conditional Complexity smell measures the number of conditionals contained by a formula. Below you see a formula with 7 nested IFs functions. This was the maximum number of nested IFs that was allowed up to Excel 2003.

Conditional Complexity - 7 levels of nested IFs
Conditional Complexity – 7 levels of nested IFs

Long Calculation Chain: In spreadsheets, it is common that formulas refer to other formulas. Therefore, one could say that a spreadsheet consists of a collection of calculation chains. Tracing a long calculation chain is considered  a tedious task. The smell is measured by the length of the longest path of cells that need to be referenced when computing the value of the formula.

Duplicated Formulas: The equivalent of the Duplicate Code smell in spreadsheets is the Duplicated Formulas smell. It is measured by the number of identical formulas that are located in the same worksheet and having at least one function or operator. Below in row 39 you can see an example of four identical formulas.

Duplicated Formulas
Duplicated Formulas

However, the study in which these code smells were introduced had two limitations. First of all the dataset was very small. It consisted of only 10 spreadsheets and they were all from the same company. Secondly we didn’t know if any of these spreadsheets were suffering from smells. We want to use smells as an indicator for bad or smelly spreadsheets. But it could very well be the case that the 10 spreadsheets in the dataset were very well designed and nevertheless contained smells. What we were missing was a ground truth about the smelliness.

The dataset

Early 2014 we had the opportunity to work together with one of our industrial partners: F1F9. They are the world largest financial model building firm. They develop financial models in Excel for their customers, often based upon the customer’s existing spreadsheet models. They provided us with a very unique dataset of 54 pairs of spreadsheets1. What makes the dataset unique is the fact that we have a customer spreadsheet and a matching spreadsheet that was created by F1F9 based upon the customer spreadsheet. In other words we have pairs of spreadsheets with and without ‘treatment’. Furthermore it is reasonable to assume that the customer spreadsheets contain some problems. If that was not the case, the customer wouldn’t ask F1F9 to rebuild the model. This dataset thus overcomes the two limitations we mentioned earlier. The dataset is much larger. We now have 108 spreadsheets which are coming from different companies and we know that half of them are smelly. The dataset also enables us to do a pairwise comparison.

Were the smells reduced?

The question that immediately popped into our minds was: Do F1F9 sheets contain fewer smells? To answer that question we analyzed the spreadsheets. And indeed we saw that the occurrence of smells in the F1F9 sheets was much lower. This was true for four of the five smells. The exception was the Long Calculation smell. The occurrence of this smell was slightly higher in the F1F9 sheets than in the customer sheets.

Now what did the model builders of F1F9 do different than the customers? First of all they build financial models for a living. But another important difference is that they use a strict method for building spreadsheets. They use the so called FAST Standard. In this standard we find a set of guidelines that helps us to understand why the F1F9 sheets contain less smells.

First of all FAST strongly advices to create short and easy to understand formulas. Of course this helps to decrease the occurrence of the Multiple Operations and Multiple References smells. Furthermore, FAST discourages the use of the IF functions. This reduces the occurrence of the Conditional Complexity Smell. A third guideline within FAST is that formulas should be consistent along row or column axis, meaning that the formula should be crated once and than dragged to the right of the bottom. Also a calculation should only be made once. Both the latter rules help to reduce the smell of Duplicated Formulas.

But what about the Long Calculation Chains? Why do they occur more in the F1F9 sheets. This is a logical result of reducing the smells Multiple References and Multiple Operations. The reduction of these two smells implies breaking long formulas in shorter parts which inevitable leads to longer calculation chains.

What about size and coupling?

So this settles the story with respect to smells. However it does not end the story for the study. In source code analysis, besides smells, other metrics with respect to size and coupling are used to analyze the quality of the code. Therefore we did the same for the spreadsheets. We defined several metrics for size (non-empty cells, # worksheets, # formulas, # unique formulas, and length of formula) and coupling (# external links per spreadsheet, # interworksheet connections, path depth, and # transitive precedents) and calculated these metrics for both the customer and F1F9 dataset.

Intuitively we expected that the customer sheets would be bigger and have a higher degree of coupling (knowing that more lines of code and a high level of coupling between methods have a negative impact on code quality). To our surprise the results proved otherwise. For four of the five metrics the F1F9 sheets were bigger than the customer sheets. With respect to coupling they scored similar, but the F1F9 sheets had significantly more interworksheet connections per spreadsheet.

So again we analyzed the FAST standard to understand these findings. The standard gives guidelines for organizing a spreadsheet model. It makes a distinction between Foundation, Workings, Presentation and Control sheets. Foundation sheets contain all the input values and assumptions of the model. All calculations are done within the Workings sheets and the results of the calculations are presented in the Presentation sheets. The control sheet is mainly used during the development of the model. It normally contains list of pending changes, version control, table of contents, etc. However this guideline increases the number of worksheets and make the spreadsheet bigger. Furthermore according to the FAST standard a consistent column structure should be maintained across all sheets. Within financial models it is not uncommon to work with a time horizon of 30 years, leading in all worksheets to at least 30 columns. And because all formulas should be consistent along rows and columns all these formulas will be repeated. These rules increase the number of empty cells and the number of formulas.

Finally FAST advices to execute all calculations in so called calculation blocks. A calculation block consists of all the ingredients that are necessary for a calculation. Ingredients can also be used in other calculations. In such a case the ingredients are repeated to form a new calculation block. This both increases the number of non-empty cells and interworksheet connections.

Conclusion

So what did we learn from this study:

  • Smells are a good indicator for identifying problematic spreadsheets.
  • By applying the guidelines of the FAST standard users decrease the occurrence of smells (even without knowing the definition of the smells itself)
  • Metrics with respect to size and coupling do not succeed in differentiating between smelly and non smelly spreadsheets.

We presented the results of this study at ICSME ’15 in Bremen on September 29th. A preprint of the paper can be found here.

The slides we used during the presentation are shown below:

Do you have any questions, remarks, insights, etc.? Feel free to share them with us. You can use the contact form below to leave us a comment.


1To protect the confidentiality of the models we only had access to the dataset on F1F9s premises and then only indirectly whereby our software automatically generated and stored only the necessary survey statistics. At no point did we have direct access to the models, nor did our software extract any commercial data from the models.

 

The spreadsheet labeling game: an implementation story

The research target is ambitious: We want to explore the meaning of spreadsheet cells. What is 4.532,23 in cell C12? Is it Sales? January Sales? January Sales of Cheese sandwiches? Which cells contain these labels? We need data, lots of it, to train and validate our algorithms. We have thousands of spreadsheets (from the EUSES and the Enron corpus). How can we enable users to help us assign meaning to arbitrary cells? We need some interface for them to match cells to their labels.

This is how the story of the spreadsheet labeling game began. It would be a webpage, showing a worksheet randomly picked from the corpus, with a randomly picked selected cell, asking users which cells describe or label it, letting them select some cells, and saving the answers. For decoration, cute chains of smiley faces (for marrier users) and usage counters (for the more competitive ones).

The toolbox for the implementation was ready: Javascript + jQuery for the client, Bootstrap for styling, a simple backend server, text files for the output. Pretty standard, right?

But how do we show the worksheets to the user? And how do we collect the users’ cell selections?

The first question was easy to answer: Google Sheets. Upload the all spreadsheets to Google Sheets and use their well-documented API to handle everything. I was one happy programmer, until I searched for an answer to the second question. Apparently, in the Google Sheets events there is Open, Change, Edit, Form submit, and that’s it. No cell click, no cell selection change, no way to collect which cell the user is selecting in the spreadsheet. We had to find another way.

We then briefly considered using the Handsontable data grid component. Instead of embedding spreadsheets to webpages, worksheets would be transformed to data grids with Excel-like appearance. The cell click collection requirement would be covered using Handsontable’s afterSelection events. However, we preferred to show the actual worksheet to the user, without converting it to any other format. This would be our backup solution, which we did not choose in the end.

After lots of searching, it was Microsoft’s OneDrive and Excel Web App that we resorted to. Surprisingly, among the very basic functionality that the Excel Services JavaScript API offers, an activeCellChanged event handler is included. That was it! We loaded a spreadsheet in an OneDrive folder, retrieved its token (“SD72E74B1ABFC5464!1390/517479313637659748/t=0&s=0&v=!AKNivqAfTbPL7nU”) from the Javascript produced when selecting “Embed”, and then started implementing the opening, handling user selections and saving the results.

But how do we retrieve the tokens that the Excel Web App requires for the thousands of spreadsheets in our corpus?

We would have to use the OneDrive API. To use it, one needs to create an application, register it with Microsoft for the OneDrive API and get an access token. Logical, but too much when one just needs the list of tokens for the files in a specific OneDrive folder. Luckily, we discovered that we could exploit the SDK provided for trying out the OneDrive API to get just what we needed. Problem solved!

And then we hit a different wall. The Excel Services JavaScript API is not offering something that we took for granted: It does not support setting the style of the cells. For the spreadsheet labelling game this was absolutely essential. The randomly-picked question cell should be clearly marked, and so should be the cells that the user selects as labels.

So, how do we change the style of cells when the API does not support it?

That took a bit of MacGyverism: we make Excel do it through conditional formatting rules. Before uploading the Excel files to OneDrive, a hidden worksheet is added to each of them, along with a conditional formatting rule “=NOT(ISERROR(FIND(SUBSTITUTE(TEXT(ADDRESS(ROW(),COLUMN()), “”)&”,”, “$”,””),hidden!$A$1)))”, applied to all non-hidden workbook cells, which marks Orange the cell whose address is written in cell hidden!A1, and a similar rule that marks Green the cells written in hidden!A2. This way, when a user selects a cell, we use the Excel Services JavaScript API to write its address to cell hidden!A2, and the cell automatically turns Green. Funny how Excel ended up solving this problem too!

It was a tricky implementation with many ups and downs, but in the end we were happy that the website delivered rich results –up to now, more than 40.000 cells were labeled with more than 95.000 labels.

Want to know more? Check the spreadsheet labeling game out!

New Paper: Business Rule Mining from Spreadsheets

In this post I talk about my new paper, which is going to be published in the proceedings of the 2nd Workshop on Software Engineering Methods in Spreadsheets (SEMS ’15) in Florence, Italy. It is a position paper describing our aims to automatically extract business rules from spreadsheets. [Pre-print in Arxiv.org is here]

What are Business Rules?
In her book Business Rules Applied: Building Better Systems Using the Business Rule Approach, Barbara von Halle writes that a business rule is a “statement that defines or constrains some aspect of the business. It is intended to assert business structure or to control or influence the behavior of the business”. Thus the term business rule is somewhat an over-encompassing term used to imply all sorts of assertions that govern the operations of a business. A purist’s definition of business rule has a specific syntax associated with it and fixed method of framing the rules, however the essence remains same i.e. such rules are all the assertions that unambiguously states the actions or operations that are necessary to keep a business running. Therefore business rules are what that contain key business knowledge of an organization or a domain.
What we know about businesses and organizations?
We know that a vast majority of businesses and organizations use spreadsheets for their operations. Therefore it is needless to say that spreadsheets are gold mines of crucial business knowledge with plethora of business rules implemented inside them. Inside spreadsheets they are usually hidden behind cryptic formulas and the heavy presence of numerical data make them difficult to be discerned.
What we can do about it?
By extracting the business rules from spreadsheets we can reap the business knowledge embedded inside them and make it available in a better readable format. And we intend to do that in an automated way with a software tool.
What are the potential benefits of doing that?
1) High Level Analysis of Spreadsheets – Extracting business rules enables generation of documentation for spreadsheets at a higher abstraction level than the spreadsheets themselves. This facilitates the following:
a) Comprehension – It becomes easier for end-users, who are typically not programmers, to understand the structure and operation of large and complex spreadsheets helping them efficiently work with or modify such spreadsheets with reduced errors and mistakes.
b) Comparison – Comparing spreadsheets becomes possible in order to estimate whether they implement same or similar functionalities, or even are identical behavior-wise only differing in data values. The latter cannot be done for example by an application that compares spreadsheets in data and formula level.
c) Validation – Organizations using set of well-formed and pre-laid business rules can validate whether the spreadsheets created by their employees accurately implement those rules or if there are errors in the logical level.
2) Understanding of Organizational Business Rationale – Some organization may not have their business strategies well laid out in business rule format; yet vital business knowledge of experts working in the company is hidden in spreadsheets. Extracting this knowledge would help to form a clear picture of how that organization works and its structure.
3) Support for Migration – IT architects need to understand the business logic when migrating functionalities and computations implemented in spreadsheets into conventional software. Furthermore business analysts need to ensure that the IT architects understood it correctly. This can be achieved through knowledge extraction and an automated process would largely help in this regard.
4) Safe Re-use and Replication of Spreadsheets – Often spreadsheets are created on ad-hoc basis by experts in an organization to implement their unique strategies for certain scenarios. Over time such spreadsheets grow in size and complexity and are used by several employees for similar scenarios but with different data sets. Invariably the users are forced to employ the method of copy-paste to replicate the original spreadsheet and customize it according to their needs by manipulating data and formula. However this process has been found to be extremely error-prone. It is probably safer to re-generate spreadsheets from scratch using the blueprint or structure of the original spreadsheet instead of copy-pasting. Automated business rule extraction can facilitate such blueprint generation and thus make replications of spreadsheets safer.

What goal we have set?
We want to devise an algorithm and subsequently an application that can automatically extract business rules from spreadsheets. Once that is done, we want to find answers to the two following research questions:

RQ1: How accurate the automatically extracted business rules will be as compared to those extracted manually by domain experts and spreadsheet users?
RQ2: How efficient is the automatic extraction process compared to manually extracting business rules from spreadsheets?

We aim to do user-studies and controlled experiments in which we will compare the results of automatic and manual extraction of business rules from spreadsheets.
What a typical problem will look like?

Capture3

Typical spreadsheets implement business rules to calculate results. For example in the above figure the cell E19 contains the formula SUM(E13:E18). From this formula our algorithm has to infer the business rule “Total earned revenue = Admissions+…+Other earned revenue”.
Mapping E13:E18 to Admissions…Other earned revenue is straightforward. However there is more to determine as the Total Earned Revenue is divided into columns for Last Year, Current Year, etc. Thus the mapping becomes two dimensional.
Furthermore a parser will reach three blank rows and an auxiliary header row (actuals, budget, etc.) before it reaches the “Year” column header row.
Making things even more challenging, the whole structure is repeated into vertical blocks viz. Earned Revenue, Private Sector Revenue.
When mapping the rule “Total private sector revenue=…” the parser will encounter formulas in the 19th row instead of reaching the column headers! Thus, same formula repeated both vertically (in blocks) and horizontally (in year columns), yet being distinct semantically, is a considerable challenge.
What are the principal difficulties?
Spreadsheets do not impose fixed structural uniformity with regards to layout. This is an inherent flexibility of spreadsheets which at the same time is an advantage and also a disadvantage. This makes the mapping between data and labels vary to a large extent from spreadsheet to spreadsheet. Towards extracting business rules, this will be a very crucial challenge to overcome.

Aragorn User Study

In my last blog post I talked about the Aragorn tool. I am happy to announce that since that post which was quite some time back, Aragorn has evolved quite a bit. It’s much more robust now and has lesser bugs. It has also gone through some functional changes!

 

Aragorn is currently available as Excel 2010 add-in.

 

Results of user-study for evaluation of Aragorn

  1.  Category A QuestionsResponses_All_141124
  2. Category B Questions:
    1. Scenario 1: Assume it is Friday afternoon and you are eagerly waiting for the week-end to start. Your own assigned tasks are pretty much done and so you are quite relaxed. However your team-lead/manager calls you up and gives you an urgent task. One of your colleagues who had been working on a model workbook is absent due to sickness. Your lead was working on it but suddenly has been called up for an urgent management level meeting so she requests you to finish the remaining modifications on the workbook as things can’t wait. So now you are left with a workbook which probably you have never laid your eyes upon and have to make changes in the span of a few hours.
    2. Scenario 2: Assume you are running a heavy schedule on a day full of meetings. You receive a call from the client side requesting that the model workbooks that were to be sent, be sent a week earlier; as due to forecasted market dynamics they really need it earlier than expected. You hurry to check work status from your team-members and find that work is almost ready but some errors have come up in testing. In order to hurry up the whole process you join in the activity of removing the errors from these workbooks which your junior team-members have created.      Results: Responses_All_141122
  1. Windows 7 installer for Aragorn: Download (Read warnings before trying)

A quick summary of what Aragorn provides:

  • It gives you instantaneous warning about dependents of cells you are about modify; even when they are situated in different worksheets.
  • It requires one-time processing of your Excel workbook and for workbooks of size of the order of 12 – 14MB the time required for processing is about half a minute.

Please go through our original post on Aragorn for a better picture about it.



Warning & Disclaimer

Aragorn is a research initiative and is in experimental stage. Please make sure you have adequate back-up copies of your workbooks on which you want to try it out. Trying Aragorn on data or workbook of critical importance to your work is not recommended at present.

Once you activate Aragorn on a workbook, any changes you made before closing will be by default saved even if you do not save it pro-actively.

Also note that if you make some changes, you need to pro-actively save the workbook in order to see the reflection of the changes in the Arasense behavior in the same session. When you save, Arasense will re-process your workbook in order to consider the new changes.

When AraSense is in turned ON state, the ‘undo’ feature does not work. If you turn it OFF it works.

Presenting Aragorn: A Floating Prompt System for Spreadsheets and the AraSense Dependence Tracking Feature

aragorn

So what (who) is Aragorn?

We all know who Aragorn is! The mighty brave king of Men, Isildur’s heir, the skilled warrior, protector of the Fellowship of the Ring and the hobbits. Unfortunately we are not talking about him here. We are in the process of developing an Excel add-in tool, which quite similar to the real Aragorn, will help you protect yourself from orcs, goblins, and such nasty creatures (mistakes, errors, pitfalls) while you work on your spreadsheets.

What made us develop Aragorn?

We were trying to have a closer look at spreadsheet dependence tracing techniques. In my blog post I discussed some of our findings from a review of the research on this topic. We found that there was still enough need for good and reliable dependence tracing techniques which can be adequately and effectively adopted by the industry in a standardized and popular way. In 2013, in her dissertationDr. Felienne Hermans wrote about some responses from spreadsheet users –

Quote

We were particularly motivated by a scenario where you, as a spreadsheet user, have the task of coming up with some calculation or financial estimate after some data crunching in a fairly large complex spreadsheet which you have not created yourself. Imagine, to obtain the required results you need to change some parameter values, modify some cells, move data around here and there. However not knowing well the full design structure of the spreadsheet you definitely do not want your changes to affect the whole model so that the results you obtain turn out to be wrong! On the other hand perhaps it´s not even worth it for you to spend days in trying to understand the spreadsheet model as it may be for a one-off use in your case. This is exactly the type of scenario where a quick, real-time prompting system displaying cell dependency information can make things a lot easier. While this was our primary motivation, we also wanted to make sure that the tool was easy and lucrative enough for you. Therefore we also aimed to ensure that the following key features were there in our tool:

  1. Effortless: We did not want to burden you with additional effort of multiple mouse clicks or peering into the screen trying to distinguish between overlapping blue arrows which all seem to point towards the same cell or no-where.
  2. Real-time: We wanted our system to be a process that is part of using the spreadsheet itself and not a separate activity to be done at a different time.
  3. Instantaneous: We wanted our system to be quick and momentary so that you do not have to wait for the information.
  4. Specific: We did not want to burden you with more information than you really need; it’s not always necessary or worthwhile to understand the whole structure of a spreadsheet when all you need to know is whether a change in one or two cells are going to affect other cells.

So what have we come up with?

The core of Aragorn is the floating prompt system. With these localized text-boxes very close to active cells, we can display crucial information about the selected cell in a transient manner.

5

In addition to this we have developed the feature called AraSense particularly to cater the need of dependence tracking especially for scenarios like I described above. It is our answer to the questions raised by the users.

We all know how the real Aragorn is also an expert ranger with superior tracking skills. He could track the trail of hobbits when they almost gave up hope of finding them alive!

 

download
Aragorn tracking hobbits’ trail.

Similarly the AraSense feature is the tracking power of the Aragorn tool. It can pre-sense dependencies and thus its called AraSENSE. It can do a one-time analysis of an opened Excel workbook, after which it will display a list of dependent cells in the floating prompts for every cell you click on. If you are not trying to edit the cell, normally the prompt will disappear automatically after a delay. If you are trying to edit a cell, the prompt will persist next to it as long as you are editing the cell. Even if the dependent cells are in a different worksheet, they will be listed in the prompt. You can turn the feature on or off based on your preference. No warning is shown for cells that do not have any dependents. Thus when the feature is turned on, if there is no prompt for a cell, then you can safely modify its contents without worrying. And if there is a prompt then you know which are the dependent cells and on which sheet they are. Thus you can be cautious and forewarned about the changes you make. The AraSense feature is effortless, real-time, instantaneous and specific. It fulfills all the aims we had about the system we wanted to give you.

So how it works?

 

So what we would like from you?

We would like to hear your thoughts on Aragorn and AraSense.

If you are a professional who uses spreadsheet at work, then do you think this tool will be helpful for you?

If you are not a spreadsheet user, but interested about end-user computing and works with usability aspects then do you think our tool addresses some key usability aspects?

Whoever you are, if you have read this and found it worthwhile of your time, we would like to thank you and would really like your thoughts and opinions. Encouragements are obviously appreciated but we would also eagerly welcome criticisms and suggestions for improvements.

New paper: Reviewing Research on Spreadsheet Dependence Tracing Techniques

 

Hi all. In my blog post Dependence Tracing: Need for an Investigation I talked about what is dependence tracing in spreadsheets and what is the necessity to review existing techniques to do the same. We did a study of the research that has been done on the topic and I am happy to inform you that the paper documenting the study and its conclusions is going to be published in the proceedings of SEMS’14. In this post I briefly highlight some key aspects of our findings. I recommend my previous post  Dependence Tracing: Need for an Investigation for those interested to have a little background information on the problem of dependence tracing.

We were interested particularly to understand the state of the art of research on dependence tracing. Our principal motivations arose from feedbacks we had obtained in previous user studies and surveys where there was a clear dissatisfaction among the industrial spreadsheet users about available dependence tracing tools and enhancements. To put very frankly it is an unanimously accepted fact that MS Excel is by far the most popular spreadsheet application used in the industry today. The feature that is available for dependence tracing in-built with MS Excel is the blue tracing arrows that points to precedent or dependent cells. This technique evidently is not the most convenient one for all types of scenarios that arise for an end user when she is creating or modifying spreadsheets. It has some known issues like:

  • wastage of effort and time (one mouse click too many),
  • ambiguity (arrows pointing to cells of the same row end up superimposing on each other creating confusion)
  • interference (it does indicate the dependents but it also hampers the user’s visual clarity by creating cluttered arrows all across the worksheet) that makes it inconvenient to both work on the sheet and get idea about dependence structure simultaneously.

TracingDependents

 

Therefore it was our hypothesis under the paper’s context that better dependence techniques or tools developed under scientific research was not adequately making it across to the industry to benefit real-life spreadsheet users. Based on this hypothesis we identified three research questions:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?
  2. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?
  3. Is there any clear opportunity of improvement in the dependence tracing context?

In order to find answers for these questions, our first step was a review of the existing research literature on this topic and this preliminary review is practically what makes up this paper.

We shortlisted five published research papers for our study, each having description and details of some innovative idea for dependence tracing. Our approach was to characterize these innovations based on the following aspects:

  • The basic technique/principle/strategy
  • Tools or prototypes developed if any
  • Comments or details available on testing, performance, and limitations of the tool
  • Current status of the research and its implementation, and its perceived relevance or influence in the industrial scene
Paper Published In Technique Prototype Prototype Comments Industrial Relevance
  1. 1998
Fluid Visualization & Animation Pad++ (now defunct) and Python for UNIX Performance drastically worsens after 400 cells.No user study. No implementation plan given. Some ideological influence can be observed in MS Excel 2007 onwards.
  1. 1999
3D Interactive lift-up operation OpenGL APIs and SLSC for UNIX None available.  None.
    3.  2000 Visual indications superimposed on Excel sheets like arrows, highlights, shading. VBA plug-in for Excel. None available. No standardized definite influence. However similar techniques are used in ad hoc basis by expert spreadsheet users on individually or internally to organizations.
  1. 2003
Different types of graphs and trees depicting spreadsheet dependency structure. Java tool for Excel workbooks. Successfully tested on a corpus of 259 Excel workbooks of undisclosed source.No user acceptance or comprehension study. Authors apparently did not carry forward their work on this project.
  1. 2011
Leveled dataflow diagrams C# toolkit that accepts Excel workbooks for input. Extensive user testing with 27 industrial spreadsheet users in a financial asset management company. Is currently a part of service offered by the company Infotron.nl

 

Thus, from this review we observe that:

  1. All 5 papers provide innovative ideas to simplify dependence tracing with visualization aids.

=>  There is no lack of ideas in this context.

  1. All 5 papers provide either tools or their prototypes to demonstrate the techniques.

=> The ideas are quite conceivable.

  1. 2 out of 5 prototypes are based on defunct or obsolete technology. They have not been propagated to newer current technologies. One that is based on VBA (made in 2000) is supposed to work for all Excel versions but is actually subject to be tested for version incompatibility. Only 2 out of 5 are made with Java or C# and can be considered as modern enough to actually be working at present.

=> There is lack of recent work done based on current state of technology that can be used at present in the industry.

  1. 3 out of 5 were properly tested. One showed very poor performance as 400 cells is nothing compared to the number of cells that is found in spreadsheets used in the industry! No solution or approach was provided about how to scale this up.

=> All ideas that are conceived need to be tested for performance. If that is not done, then it is evident why it will not find itself across to the industry and actually help real-life spreadsheet users.

  1. Only 1 out of 5 was tested for useability and user-acceptance!

=> Innovative ideas are all exciting, but they are meaningless if actual users don’t find them helpful. Without user studies the effectiveness of ideas or research can not be established.

  1. Only 1 out of 5 actually ended up having industrial exposure. All the rest were practically aborted or left off by their creators.

=> There must be serious issues in making these tools actually popular to the real life spreadsheet users; or else the researchers working in the domain are not properly motivated to do anything about the problems.

 

We can now revisit our research questions taking into account the above observations:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?

Although still not quite clear as to what is the reason behind this, one reason seems to be just plain lack of motivation on part of researchers. Otherwise why not follow up on your innovation and see to it that it gets implemented. A reason behind this reason can also be the pressure in the academia which encourages writing tons of papers on new ideas but does not give equal limelight to follow-ups and industrial implementations!? This may cause researchers to abandon their projects at the conception stage and move on to newer projects.

Another reason seems to be difficulty in performance scaling. Spreadsheets are tricky objects with sizes varying from small to very large. The range of sizes is quite phenomenal. This might be a cause due to which innovative ideas are found difficult to be put into real practical use? This reason is indicated by the fact that most of the projects we covered in the review were not extensively tested.

  1. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?

Evidently there is none. And no wonder why only one of the projects was tested extensively with a group of real-life spreadsheet users. This is a very important necessity. To come up with a set of standards and metrics that can define useability or user-friendliness of spreadsheet aid applications. In that way it will be easy to establish the true worth of innovative techniques as laid against backdrop of effectiveness to real-life users.

  1. Is there any clear opportunity of improvement in the dependence tracing context?

Based on our review and the observations obtained, it is doubtless that this question is still a very much wide open one. There can be enough opportunity found to research and come up with new tools that actually make a difference in the spreadsheet industry in form of reliable and user-friendly implementations. And not only ones that are created in ad hoc basis for internal use in organizations but robust standardized ones that can be widely adopted through out the industry.

Spreadsheets for financial modeling, the FAST standard and F1F9

I had very recently visited the offices of a financial modeling company in Delhi. It is a British financial modeling company with their primary modeling team based out of Delhi, India. In this post I would be summarizing some of my findings during my visit. It should grant a perspective of the state of the art on use of spreadsheets for financial modeling purposes.

  • What’s the FAST standard?

The FAST acronym stands for Flexible, Appropriate, Structured and Transparent. It is essentially a standard for financial modelling.

  • Wait, what is it again?

Okay, just like in Java language we have several hoards of rules, guidelines and conventions to follow when we are coding, the FAST standard defines rules and conventions to follow while doing financial modeling. Financial modeling has been traditionally done with and in spreadsheets and thus the FAST standard basically set guidelines and best practices while working with spreadsheets with the purpose of various types of financial modeling. Furthermore as traditionally the most popular and widely used spreadsheet in modern times has been MS Excel, the FAST standard, although universal in intent, has become inevitably correlated with the Excel technology; as in it tries to work around the weaknesses and exploit the strengths of the Excel environment. Nevertheless the guidelines themselves are not Excel specific and it does not for example has Excel keyboard shortcuts or macro based shortcuts as part of it. However the recommendation to use shortcuts instead of mouse is certainly a part of it. So how well or whether FAST standard can be implemented on other environments than excel is an exploratory question.

  • Where can one find it?

The FAST standard is an open standard and is maintained by an universal group at http://www.fast-standard.org/

  • What is F1F9? (Apart from being a weird word, more of a cryptic code rather!)

F1F9 is a financial modeling company originating in UK with a considerable global presence across various regions of the world. Its their office that I had visited.

  • What with the weird name?

F1 and F9 are two keyboard shortcuts in MS Excel: Help & Calculate…. that is the company’s business, to help people calculate. Thus the naming F1F9. (Cheeky eh?)

  • F1F9 and FAST??

The FAST standard was conceived by John Richter and Morten Siersted around a decade back when they were both working in F1F9. It was later released as an open universal standard and has been ever since contributed to by many modelers from across the world. F1F9 has forever pioneered the cause of FAST and still believes in and follows it for its own modeling activities.

So there I was in the Delhi office of F1F9. Much of our work in the Spreadsheet Lab revolves around the fundamental research question of how to apply software engineering methodologies and good practices to the domain of spreadsheet programming to improve it and to resolve known and unknown issues; where the “programmers” are not really programmers but are all types of managers and business executives.

The idea was to learn how F1F9 is using spreadsheet programming to do their work of financial modeling; to identify areas where software engineering methods can be implemented; to perceive issues they might be facing with spreadsheets which they have accepted as part of life but which perhaps we at Spreadsheet Lab can solve.

So it was the FAST standard that I learned as the very first thing with a practical example of how it was being utilized to do a real life financial modeling.

**************************************

Very soon I observed some obvious similarities between how financial modeling is being done in F1F9 and several widely used coding conventions and software engineering best practices. I compare the similarities based on two types, type one being similarities largely due to the FAST standard which is but analogous to best practices and conventions in coding; type two being similarities in the whole spreadsheet engineering processes and conventional software engineering processes and thus not being limited just to the domain of coding.

Note that these are only the similarities I have already identified based only on my current knowledge level of financial modeling, FAST standard and F1F9’s modeling style and therefore this list can surely grow in the future as I get to know more about these things.

What similarities can be identified with software engineering context:

Similarities due to the FAST standard: The FAST standard has several best practices and guidelines recommended in it to follow while doing financial modeling. It is similar to various standards and coding conventions that we follow in the software engineering world to come up with more mature and robust software. Certain parallels that are readily identifiable are the following: (Currently based only on the aspects of the FAST standard I have already learned)

  • Variable Naming Conventions:

In most programming paradigm or language we have well-defined variable naming conventions. Following these conventions are not required by the language itself or the compiler; the programs can function correctly even if they are not followed. However these conventions are religiously followed in order to make code more readable. To make programmers easily understand purpose of a piece of code; so that even someone who has not written the code can understand it. Its purpose is to provide some sort of meta data to the programmer. Something similar has been recommended in the FAST standard. It is not really naming convention but it serves the same purpose; it expresses information about all the variables and data that is used in the spreadsheet.

How it is done?

Unlike naming conventions and using upper case, lower case, camel case and so on in FAST it is done using different types of cell highlight colors, and relative position (columns in Excel) of variables, constants, series, calculations etc. in the spreadsheet, and maintaining this positions uniformly across the spreadsheets. Just by looking at which column the particular data is located or the cell highlight color a modeler can understand the nature and purpose of that data.

  • Addition of Comments

Like commenting is one of the most essential best practices for coding of programs, the same is also in use under the FAST standard. Cell comments are used to demarcate rows or areas of the spreadsheet which is in some way having some ambiguities or is unusual.

  • Use of CONSTANT declaration

In programming logic it is often considered good practice to not use constants directly inside operation expressions as hard coded values. Instead it is recommended to use certain variables for this purpose and fixing their values by declaring them as constants. In various programming languages this is done in various ways.

Similarly the FAST standard recommends not using embedded hardcoded values inside formulas; rather it is suggested to input constant values in certain cells and use those cell references after proper anchoring inside of formulas used across the spreadsheets. Like most programming language naming conventions identify such constant entities from their UPPER CASE names here it is identified from their cell highlight color and column location.

  • Driver and stub modules

In coding practice and in software engineering we have the culture of using driver and stub modules which are practically dummy modules that have been not coded yet, having some hardcoded values or values that are yet to be calculated in the software or program itself. These modules then work in conjunction with the modules of the program that have been actually completed in order to test or demonstrate the overall functionality of the program. Subsequently they are also coded and completed at a later stage. Similar thing is done in financial modeling in the FAST standard.

How?

Use of placeholders are serving the same purpose like drivers or stubs. Placeholders are those values which are not originally supplied and are temporarily taken as assumptions. Based on the assumptions the rest of the model is completed and placeholders can then eventually be replaced with real values obtained at a later stage. Using placeholders enable the modelers to continue with modeling work and not wait for some values that have been not yet supplied.

  • Coding line length limit

In coding, many standards, guidelines and editors recommend or enforce something called the coding line length cap. This limits the coders from writing a line of code which more than 100 or 120 characters long because after this length readability of the code is seriously messed up. Similar thing is recommended in the FAST standard namely the “Thumb Rule” for the cell formulas. This rule states that any formula should not be so complicated or long that it is more than the length of your thumb. When requirement is there for a longer formula surely in some way the task can be divided and done by two or more shorter formulas.

  • Call by Reference and Call by Value

In programming we clearly distinguish between call by reference and call by value in context of method calls. Methods are smaller modules of a single program where some tasks or calculations are done. When methods are invoked, data is passed to them by the main program through inputs. These inputs can be variable references or pointers which means when the method will work with them the original variables  may become modified. The inputs can also be temporary variables with values copied from the original main program variable, in which case the methods can return new value after processing but cannot directly modify the original main program variables. Both has its own merits but normally it is considered safer to do call by value i.e. using copied values than original variables themselves.

Similarly in financial modeling, under the FAST standard, it is encouraged to use range names and not direct cell references when working with them in VBA code inside Macros. Direct cell reference can become altered when rows or columns are added to the sheet; range names however stick to the data in the cell and shifts along with the cell if pushed by new rows or columns. Therefore using range names inside Macros ensure that they always get the correct data passed to them or the code need not be repeatedly changed whenever rows or columns are added.

  • Use of checker tools:

Like in software engineering various types of checker tools and inbuilt checks in editors are used to ensure that best practices and conventions are maintained in coding, similarly automated checks for FAST standard compliance of the models is being done using various types of software checker tools.

Similarities due to model development cycle or process:

  • Configuration Control:

In software engineering build maintenance and configuration control play a very important part. Although there these operations are large and elaborate, here in F1F9 an elementary form of it is observed in the way version is controlled and back up is archived of the Excel workbooks containing the models in a manual fashion.

********************************

My first point of entry into the world of research work in spreadsheet use/engineering was the work of Felienne Hermans of the Software Engineering Research Group at TU Delft. From my primary explorations of her work I arrived at the understanding of how vulnerable the spreadsheet programming paradigm is to horrendous levels of mistakes, errors and goof ups.  Not even sure whether it can be called “programming” at all, because its mere use of spreadsheets governed by no rules or structures and moreover the “users” don’t know sh#% about programming. Yet the volume of this usage is humongous and so is the horror stories related to spreadsheet errors. Companies have lost billions of dollars, academics have lost reputation and credibility, CEO s have lost their jobs just due to spreadsheet errors. The Dirty Dozen book by F1F9 has quite a few examples of cases like this. This brings out how important it is to do something about it; to make some changes and bring some improvements so that the heavy dependence of the industry on spreadsheets can be sustained sans the inevitable hazardous risks related to spreadsheets use. The question that arises immediately is what is the reason behind such a risk?

 

Why are there so many problems?

The formalized answer to this question is perhaps beyond the scope of this article but in a simple way from what I understood, it all has to do with not following rules or good practices. From what I observe is that companies are functioning in three areas when it comes to working with spreadsheets.

Venn1

By “rules” in the above diagram I mean rules, best practices, techniques and process flows. In other words we have to understand between spreadsheet use and spreadsheet engineering. Spreadsheets can be used in the way artisans make pottery or handicraft or a genius artist paints a masterpiece (though  meaning of which people find difficult to understand). On the other hand spreadsheets can be used in the way automobile is mass manufactured in the assembly line and rolled out on the streets. Anybody who knows to drive a Toyota will also be able to drive a Merc, BMW, Jeep, Renault, Ford or whatever make it is of the car. It is important to distinguish between engineering and handicraft or artistry. If I try to extend the Venn diagram to the analogy of cars it would be something like this…

Venn2

A weird car design!
Figure A

I hope I have been able to send the point across by now. I hope now it is a bit easier to understand as Why there will be so many accidents in the road? Or Why the spreadsheet use today is vulnerable to erroneous effects? Majority of companies using spreadsheets today are operating in the reddish shaded part of the Venn diagrams in regards to how they use spreadsheets. Only a minority is operating in the bluish shaded portion; F1F9 and any other organization following the FAST standard being inside that area.

So what is the solution?

Again the formalized answer to such a question may be one or more PhD theses themselves and hence out of scope of this article. However to approach in a simple way now I will move away from the automobile industry analogy to the domain of medical industry. What is disease and health? A disease is when a malfunction has already victimized a human body and must be treated or cured. On the other hand health is the human body’s inherent capability to resist against potential diseases n’est-ce pas? The healthier a person is the less prone he or she is to various types of diseases. Increasing or maintaining health is therefore a sort of prevention from possibilities of disease.

The issue of spreadsheets can be approached in the manner analogical to above. Spreadsheet masses that are already existing in the world need to be cured of problems, got rid of issues, re-worked, re-engineered, or simply used with additional augmentations like visualizations, analyzers or such tools just like wheel chairs, artificial legs, prosthetics and so on! And on the other hand efforts must be applied in order to create healthier spreadsheets (luckily, unlike human beings where we have no control over the genetic prototype as of yet, here we have control of their design); to ensure that spreadsheets are created following set of logical, reproducible, feasible set of rules, carefully planned processes and under the guidance of a set of best practices.

So what’s happening in F1F9?

From my stint of study in the F1F9 office at Delhi I have reached a conclusion. A conclusion that summarizes in brief whatever is going on there based on what all I learnt. What is happening is just that F1F9 is consistently and meticulously producing healthy babies. Oops sorry, what I meant is healthy spreadsheets! They have chosen the path of prevention and they strive to avoid running into situations where they might need curesThey are not mere spreadsheets users but they are spreadsheet engineers. Their principal niche is financial modeling but this primary activity they are doing with spreadsheets and they are using spreadsheets in a really commendable manner. And how they are doing this? By following the FAST standard template of course. The FAST standard in the domain of financial modeling based on spreadsheets use is almost equivalent a secret genetic template in the hands of God to create healthier human beings! (This is merely my personal opinion at present point of time and may sound a bit exaggerated but yet the exaggeration has good substance behind it, this I can guarantee). If you carefully consider the best practices laid down by the FAST standard that is rigidly followed at F1F9 and compare it to the myriad of problems highlighted in Felienne Hermans research work, you are bound to feel (and eventually logically conclude) that such problems are very unlikely to occur in spreadsheets created in F1F9 or anywhere else where FAST standard is being followed! The cases studied under the research were mostly replete with everything that the FAST standard categorically asks to avoid and devoid of everything the FAST standard tries to religiously enforce. So in one word F1F9 had definitely taken one step forward in the process of spreadsheets evolution and the FAST standard is a golden key leading towards a bright future for the domain.