Category Archives: Papers

Code Smells in Spreadsheet Formulas Revisited on an Industrial Dataset

tl;dr


A few years ago, Felienne  adapted code smells to be applicable on spreadsheet formulas. The smell detection algorithm was validated on a small dataset of industrial spreadsheets by interviewing the users of these spreadsheets. Now a more in depth validation of the algorithm is done by analyzing a new set of spreadsheets of which users indicated whether or not they are smelly.

This new dataset gives us the unique possibility to get more insight in how we can distinguish `bad’ spreadsheets from `good’ spreadsheets. We do that in two ways: For both the smelly and non smelly spreadsheets we 1) have calculated the metrics that detect the smells and 2) have calculated metrics with respect to size, level of coupling. The results show that indeed the metrics for the smells decrease in spreadsheets that are not smelly. With respect to size we found to our surprise that the improved spreadsheets were not smaller, but bigger. With regard to coupling  both datasets are similar. It indicates that it is difficult to use metrics with respect to size or degree of coupling to draw conclusions on the complexity of a spreadsheet.


Introduction

We are probably all familiar with the concept of code smells. A few years ago Felienne adapted code smells to use them in spreadsheets. This year we had the opportunity to apply these code smells on a very interesting industrial dataset. But before I explain to you the uniqueness of this dataset, let’s do a quick review of the code smells. In her paper Felienne introduced the following code smells:

Smells

Multiple Operations: Inspired by the code smell Long Method, this smell indicates the length of the formula. It measures the total number of operations that a formula contains. The figure below shows a formula that is suffering from this smell with a total of 15 unique operations within the formula.

Multiple Operations - fifteen unique operations within one formula
Multiple Operations – fifteen unique operations within one formula

Multiple References: Another well known code smell is Many Parameters. The spreadsheet formula equivalent is Multiple References. It counts the number of ranges a formula is referring to. An example of this smell is a formula with 79 references.

Multiple References - Excel formula with 79 references
Multiple References – Excel formula with 79 references

Conditional Complexity: Many nested conditional operations are considered as a threat to code readability. The same is true for spreadsheet formulas. The Conditional Complexity smell measures the number of conditionals contained by a formula. Below you see a formula with 7 nested IFs functions. This was the maximum number of nested IFs that was allowed up to Excel 2003.

Conditional Complexity - 7 levels of nested IFs
Conditional Complexity – 7 levels of nested IFs

Long Calculation Chain: In spreadsheets, it is common that formulas refer to other formulas. Therefore, one could say that a spreadsheet consists of a collection of calculation chains. Tracing a long calculation chain is considered  a tedious task. The smell is measured by the length of the longest path of cells that need to be referenced when computing the value of the formula.

Duplicated Formulas: The equivalent of the Duplicate Code smell in spreadsheets is the Duplicated Formulas smell. It is measured by the number of identical formulas that are located in the same worksheet and having at least one function or operator. Below in row 39 you can see an example of four identical formulas.

Duplicated Formulas
Duplicated Formulas

However, the study in which these code smells were introduced had two limitations. First of all the dataset was very small. It consisted of only 10 spreadsheets and they were all from the same company. Secondly we didn’t know if any of these spreadsheets were suffering from smells. We want to use smells as an indicator for bad or smelly spreadsheets. But it could very well be the case that the 10 spreadsheets in the dataset were very well designed and nevertheless contained smells. What we were missing was a ground truth about the smelliness.

The dataset

Early 2014 we had the opportunity to work together with one of our industrial partners: F1F9. They are the world largest financial model building firm. They develop financial models in Excel for their customers, often based upon the customer’s existing spreadsheet models. They provided us with a very unique dataset of 54 pairs of spreadsheets1. What makes the dataset unique is the fact that we have a customer spreadsheet and a matching spreadsheet that was created by F1F9 based upon the customer spreadsheet. In other words we have pairs of spreadsheets with and without ‘treatment’. Furthermore it is reasonable to assume that the customer spreadsheets contain some problems. If that was not the case, the customer wouldn’t ask F1F9 to rebuild the model. This dataset thus overcomes the two limitations we mentioned earlier. The dataset is much larger. We now have 108 spreadsheets which are coming from different companies and we know that half of them are smelly. The dataset also enables us to do a pairwise comparison.

Were the smells reduced?

The question that immediately popped into our minds was: Do F1F9 sheets contain fewer smells? To answer that question we analyzed the spreadsheets. And indeed we saw that the occurrence of smells in the F1F9 sheets was much lower. This was true for four of the five smells. The exception was the Long Calculation smell. The occurrence of this smell was slightly higher in the F1F9 sheets than in the customer sheets.

Now what did the model builders of F1F9 do different than the customers? First of all they build financial models for a living. But another important difference is that they use a strict method for building spreadsheets. They use the so called FAST Standard. In this standard we find a set of guidelines that helps us to understand why the F1F9 sheets contain less smells.

First of all FAST strongly advices to create short and easy to understand formulas. Of course this helps to decrease the occurrence of the Multiple Operations and Multiple References smells. Furthermore, FAST discourages the use of the IF functions. This reduces the occurrence of the Conditional Complexity Smell. A third guideline within FAST is that formulas should be consistent along row or column axis, meaning that the formula should be crated once and than dragged to the right of the bottom. Also a calculation should only be made once. Both the latter rules help to reduce the smell of Duplicated Formulas.

But what about the Long Calculation Chains? Why do they occur more in the F1F9 sheets. This is a logical result of reducing the smells Multiple References and Multiple Operations. The reduction of these two smells implies breaking long formulas in shorter parts which inevitable leads to longer calculation chains.

What about size and coupling?

So this settles the story with respect to smells. However it does not end the story for the study. In source code analysis, besides smells, other metrics with respect to size and coupling are used to analyze the quality of the code. Therefore we did the same for the spreadsheets. We defined several metrics for size (non-empty cells, # worksheets, # formulas, # unique formulas, and length of formula) and coupling (# external links per spreadsheet, # interworksheet connections, path depth, and # transitive precedents) and calculated these metrics for both the customer and F1F9 dataset.

Intuitively we expected that the customer sheets would be bigger and have a higher degree of coupling (knowing that more lines of code and a high level of coupling between methods have a negative impact on code quality). To our surprise the results proved otherwise. For four of the five metrics the F1F9 sheets were bigger than the customer sheets. With respect to coupling they scored similar, but the F1F9 sheets had significantly more interworksheet connections per spreadsheet.

So again we analyzed the FAST standard to understand these findings. The standard gives guidelines for organizing a spreadsheet model. It makes a distinction between Foundation, Workings, Presentation and Control sheets. Foundation sheets contain all the input values and assumptions of the model. All calculations are done within the Workings sheets and the results of the calculations are presented in the Presentation sheets. The control sheet is mainly used during the development of the model. It normally contains list of pending changes, version control, table of contents, etc. However this guideline increases the number of worksheets and make the spreadsheet bigger. Furthermore according to the FAST standard a consistent column structure should be maintained across all sheets. Within financial models it is not uncommon to work with a time horizon of 30 years, leading in all worksheets to at least 30 columns. And because all formulas should be consistent along rows and columns all these formulas will be repeated. These rules increase the number of empty cells and the number of formulas.

Finally FAST advices to execute all calculations in so called calculation blocks. A calculation block consists of all the ingredients that are necessary for a calculation. Ingredients can also be used in other calculations. In such a case the ingredients are repeated to form a new calculation block. This both increases the number of non-empty cells and interworksheet connections.

Conclusion

So what did we learn from this study:

  • Smells are a good indicator for identifying problematic spreadsheets.
  • By applying the guidelines of the FAST standard users decrease the occurrence of smells (even without knowing the definition of the smells itself)
  • Metrics with respect to size and coupling do not succeed in differentiating between smelly and non smelly spreadsheets.

We presented the results of this study at ICSME ’15 in Bremen on September 29th. A preprint of the paper can be found here.

The slides we used during the presentation are shown below:

Do you have any questions, remarks, insights, etc.? Feel free to share them with us. You can use the contact form below to leave us a comment.


1To protect the confidentiality of the models we only had access to the dataset on F1F9s premises and then only indirectly whereby our software automatically generated and stored only the necessary survey statistics. At no point did we have direct access to the models, nor did our software extract any commercial data from the models.

 

New Paper: Business Rule Mining from Spreadsheets

In this post I talk about my new paper, which is going to be published in the proceedings of the 2nd Workshop on Software Engineering Methods in Spreadsheets (SEMS ’15) in Florence, Italy. It is a position paper describing our aims to automatically extract business rules from spreadsheets. [Pre-print in Arxiv.org is here]

What are Business Rules?
In her book Business Rules Applied: Building Better Systems Using the Business Rule Approach, Barbara von Halle writes that a business rule is a “statement that defines or constrains some aspect of the business. It is intended to assert business structure or to control or influence the behavior of the business”. Thus the term business rule is somewhat an over-encompassing term used to imply all sorts of assertions that govern the operations of a business. A purist’s definition of business rule has a specific syntax associated with it and fixed method of framing the rules, however the essence remains same i.e. such rules are all the assertions that unambiguously states the actions or operations that are necessary to keep a business running. Therefore business rules are what that contain key business knowledge of an organization or a domain.
What we know about businesses and organizations?
We know that a vast majority of businesses and organizations use spreadsheets for their operations. Therefore it is needless to say that spreadsheets are gold mines of crucial business knowledge with plethora of business rules implemented inside them. Inside spreadsheets they are usually hidden behind cryptic formulas and the heavy presence of numerical data make them difficult to be discerned.
What we can do about it?
By extracting the business rules from spreadsheets we can reap the business knowledge embedded inside them and make it available in a better readable format. And we intend to do that in an automated way with a software tool.
What are the potential benefits of doing that?
1) High Level Analysis of Spreadsheets – Extracting business rules enables generation of documentation for spreadsheets at a higher abstraction level than the spreadsheets themselves. This facilitates the following:
a) Comprehension – It becomes easier for end-users, who are typically not programmers, to understand the structure and operation of large and complex spreadsheets helping them efficiently work with or modify such spreadsheets with reduced errors and mistakes.
b) Comparison – Comparing spreadsheets becomes possible in order to estimate whether they implement same or similar functionalities, or even are identical behavior-wise only differing in data values. The latter cannot be done for example by an application that compares spreadsheets in data and formula level.
c) Validation – Organizations using set of well-formed and pre-laid business rules can validate whether the spreadsheets created by their employees accurately implement those rules or if there are errors in the logical level.
2) Understanding of Organizational Business Rationale – Some organization may not have their business strategies well laid out in business rule format; yet vital business knowledge of experts working in the company is hidden in spreadsheets. Extracting this knowledge would help to form a clear picture of how that organization works and its structure.
3) Support for Migration – IT architects need to understand the business logic when migrating functionalities and computations implemented in spreadsheets into conventional software. Furthermore business analysts need to ensure that the IT architects understood it correctly. This can be achieved through knowledge extraction and an automated process would largely help in this regard.
4) Safe Re-use and Replication of Spreadsheets – Often spreadsheets are created on ad-hoc basis by experts in an organization to implement their unique strategies for certain scenarios. Over time such spreadsheets grow in size and complexity and are used by several employees for similar scenarios but with different data sets. Invariably the users are forced to employ the method of copy-paste to replicate the original spreadsheet and customize it according to their needs by manipulating data and formula. However this process has been found to be extremely error-prone. It is probably safer to re-generate spreadsheets from scratch using the blueprint or structure of the original spreadsheet instead of copy-pasting. Automated business rule extraction can facilitate such blueprint generation and thus make replications of spreadsheets safer.

What goal we have set?
We want to devise an algorithm and subsequently an application that can automatically extract business rules from spreadsheets. Once that is done, we want to find answers to the two following research questions:

RQ1: How accurate the automatically extracted business rules will be as compared to those extracted manually by domain experts and spreadsheet users?
RQ2: How efficient is the automatic extraction process compared to manually extracting business rules from spreadsheets?

We aim to do user-studies and controlled experiments in which we will compare the results of automatic and manual extraction of business rules from spreadsheets.
What a typical problem will look like?

Capture3

Typical spreadsheets implement business rules to calculate results. For example in the above figure the cell E19 contains the formula SUM(E13:E18). From this formula our algorithm has to infer the business rule “Total earned revenue = Admissions+…+Other earned revenue”.
Mapping E13:E18 to Admissions…Other earned revenue is straightforward. However there is more to determine as the Total Earned Revenue is divided into columns for Last Year, Current Year, etc. Thus the mapping becomes two dimensional.
Furthermore a parser will reach three blank rows and an auxiliary header row (actuals, budget, etc.) before it reaches the “Year” column header row.
Making things even more challenging, the whole structure is repeated into vertical blocks viz. Earned Revenue, Private Sector Revenue.
When mapping the rule “Total private sector revenue=…” the parser will encounter formulas in the 19th row instead of reaching the column headers! Thus, same formula repeated both vertically (in blocks) and horizontally (in year columns), yet being distinct semantically, is a considerable challenge.
What are the principal difficulties?
Spreadsheets do not impose fixed structural uniformity with regards to layout. This is an inherent flexibility of spreadsheets which at the same time is an advantage and also a disadvantage. This makes the mapping between data and labels vary to a large extent from spreadsheet to spreadsheet. Towards extracting business rules, this will be a very crucial challenge to overcome.

Aragorn User Study

In my last blog post I talked about the Aragorn tool. I am happy to announce that since that post which was quite some time back, Aragorn has evolved quite a bit. It’s much more robust now and has lesser bugs. It has also gone through some functional changes!

 

Aragorn is currently available as Excel 2010 add-in.

 

Results of user-study for evaluation of Aragorn

  1.  Category A QuestionsResponses_All_141124
  2. Category B Questions:
    1. Scenario 1: Assume it is Friday afternoon and you are eagerly waiting for the week-end to start. Your own assigned tasks are pretty much done and so you are quite relaxed. However your team-lead/manager calls you up and gives you an urgent task. One of your colleagues who had been working on a model workbook is absent due to sickness. Your lead was working on it but suddenly has been called up for an urgent management level meeting so she requests you to finish the remaining modifications on the workbook as things can’t wait. So now you are left with a workbook which probably you have never laid your eyes upon and have to make changes in the span of a few hours.
    2. Scenario 2: Assume you are running a heavy schedule on a day full of meetings. You receive a call from the client side requesting that the model workbooks that were to be sent, be sent a week earlier; as due to forecasted market dynamics they really need it earlier than expected. You hurry to check work status from your team-members and find that work is almost ready but some errors have come up in testing. In order to hurry up the whole process you join in the activity of removing the errors from these workbooks which your junior team-members have created.      Results: Responses_All_141122
  1. Windows 7 installer for Aragorn: Download (Read warnings before trying)

A quick summary of what Aragorn provides:

  • It gives you instantaneous warning about dependents of cells you are about modify; even when they are situated in different worksheets.
  • It requires one-time processing of your Excel workbook and for workbooks of size of the order of 12 – 14MB the time required for processing is about half a minute.

Please go through our original post on Aragorn for a better picture about it.



Warning & Disclaimer

Aragorn is a research initiative and is in experimental stage. Please make sure you have adequate back-up copies of your workbooks on which you want to try it out. Trying Aragorn on data or workbook of critical importance to your work is not recommended at present.

Once you activate Aragorn on a workbook, any changes you made before closing will be by default saved even if you do not save it pro-actively.

Also note that if you make some changes, you need to pro-actively save the workbook in order to see the reflection of the changes in the Arasense behavior in the same session. When you save, Arasense will re-process your workbook in order to consider the new changes.

When AraSense is in turned ON state, the ‘undo’ feature does not work. If you turn it OFF it works.

New paper: Reviewing Research on Spreadsheet Dependence Tracing Techniques

 

Hi all. In my blog post Dependence Tracing: Need for an Investigation I talked about what is dependence tracing in spreadsheets and what is the necessity to review existing techniques to do the same. We did a study of the research that has been done on the topic and I am happy to inform you that the paper documenting the study and its conclusions is going to be published in the proceedings of SEMS’14. In this post I briefly highlight some key aspects of our findings. I recommend my previous post  Dependence Tracing: Need for an Investigation for those interested to have a little background information on the problem of dependence tracing.

We were interested particularly to understand the state of the art of research on dependence tracing. Our principal motivations arose from feedbacks we had obtained in previous user studies and surveys where there was a clear dissatisfaction among the industrial spreadsheet users about available dependence tracing tools and enhancements. To put very frankly it is an unanimously accepted fact that MS Excel is by far the most popular spreadsheet application used in the industry today. The feature that is available for dependence tracing in-built with MS Excel is the blue tracing arrows that points to precedent or dependent cells. This technique evidently is not the most convenient one for all types of scenarios that arise for an end user when she is creating or modifying spreadsheets. It has some known issues like:

  • wastage of effort and time (one mouse click too many),
  • ambiguity (arrows pointing to cells of the same row end up superimposing on each other creating confusion)
  • interference (it does indicate the dependents but it also hampers the user’s visual clarity by creating cluttered arrows all across the worksheet) that makes it inconvenient to both work on the sheet and get idea about dependence structure simultaneously.

TracingDependents

 

Therefore it was our hypothesis under the paper’s context that better dependence techniques or tools developed under scientific research was not adequately making it across to the industry to benefit real-life spreadsheet users. Based on this hypothesis we identified three research questions:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?
  2. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?
  3. Is there any clear opportunity of improvement in the dependence tracing context?

In order to find answers for these questions, our first step was a review of the existing research literature on this topic and this preliminary review is practically what makes up this paper.

We shortlisted five published research papers for our study, each having description and details of some innovative idea for dependence tracing. Our approach was to characterize these innovations based on the following aspects:

  • The basic technique/principle/strategy
  • Tools or prototypes developed if any
  • Comments or details available on testing, performance, and limitations of the tool
  • Current status of the research and its implementation, and its perceived relevance or influence in the industrial scene
Paper Published In Technique Prototype Prototype Comments Industrial Relevance
  1. 1998
Fluid Visualization & Animation Pad++ (now defunct) and Python for UNIX Performance drastically worsens after 400 cells.No user study. No implementation plan given. Some ideological influence can be observed in MS Excel 2007 onwards.
  1. 1999
3D Interactive lift-up operation OpenGL APIs and SLSC for UNIX None available.  None.
    3.  2000 Visual indications superimposed on Excel sheets like arrows, highlights, shading. VBA plug-in for Excel. None available. No standardized definite influence. However similar techniques are used in ad hoc basis by expert spreadsheet users on individually or internally to organizations.
  1. 2003
Different types of graphs and trees depicting spreadsheet dependency structure. Java tool for Excel workbooks. Successfully tested on a corpus of 259 Excel workbooks of undisclosed source.No user acceptance or comprehension study. Authors apparently did not carry forward their work on this project.
  1. 2011
Leveled dataflow diagrams C# toolkit that accepts Excel workbooks for input. Extensive user testing with 27 industrial spreadsheet users in a financial asset management company. Is currently a part of service offered by the company Infotron.nl

 

Thus, from this review we observe that:

  1. All 5 papers provide innovative ideas to simplify dependence tracing with visualization aids.

=>  There is no lack of ideas in this context.

  1. All 5 papers provide either tools or their prototypes to demonstrate the techniques.

=> The ideas are quite conceivable.

  1. 2 out of 5 prototypes are based on defunct or obsolete technology. They have not been propagated to newer current technologies. One that is based on VBA (made in 2000) is supposed to work for all Excel versions but is actually subject to be tested for version incompatibility. Only 2 out of 5 are made with Java or C# and can be considered as modern enough to actually be working at present.

=> There is lack of recent work done based on current state of technology that can be used at present in the industry.

  1. 3 out of 5 were properly tested. One showed very poor performance as 400 cells is nothing compared to the number of cells that is found in spreadsheets used in the industry! No solution or approach was provided about how to scale this up.

=> All ideas that are conceived need to be tested for performance. If that is not done, then it is evident why it will not find itself across to the industry and actually help real-life spreadsheet users.

  1. Only 1 out of 5 was tested for useability and user-acceptance!

=> Innovative ideas are all exciting, but they are meaningless if actual users don’t find them helpful. Without user studies the effectiveness of ideas or research can not be established.

  1. Only 1 out of 5 actually ended up having industrial exposure. All the rest were practically aborted or left off by their creators.

=> There must be serious issues in making these tools actually popular to the real life spreadsheet users; or else the researchers working in the domain are not properly motivated to do anything about the problems.

 

We can now revisit our research questions taking into account the above observations:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?

Although still not quite clear as to what is the reason behind this, one reason seems to be just plain lack of motivation on part of researchers. Otherwise why not follow up on your innovation and see to it that it gets implemented. A reason behind this reason can also be the pressure in the academia which encourages writing tons of papers on new ideas but does not give equal limelight to follow-ups and industrial implementations!? This may cause researchers to abandon their projects at the conception stage and move on to newer projects.

Another reason seems to be difficulty in performance scaling. Spreadsheets are tricky objects with sizes varying from small to very large. The range of sizes is quite phenomenal. This might be a cause due to which innovative ideas are found difficult to be put into real practical use? This reason is indicated by the fact that most of the projects we covered in the review were not extensively tested.

  1. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?

Evidently there is none. And no wonder why only one of the projects was tested extensively with a group of real-life spreadsheet users. This is a very important necessity. To come up with a set of standards and metrics that can define useability or user-friendliness of spreadsheet aid applications. In that way it will be easy to establish the true worth of innovative techniques as laid against backdrop of effectiveness to real-life users.

  1. Is there any clear opportunity of improvement in the dependence tracing context?

Based on our review and the observations obtained, it is doubtless that this question is still a very much wide open one. There can be enough opportunity found to research and come up with new tools that actually make a difference in the spreadsheet industry in form of reliable and user-friendly implementations. And not only ones that are created in ad hoc basis for internal use in organizations but robust standardized ones that can be widely adopted through out the industry.