Tag Archives: Sohon

New Paper: Business Rule Mining from Spreadsheets

In this post I talk about my new paper, which is going to be published in the proceedings of the 2nd Workshop on Software Engineering Methods in Spreadsheets (SEMS ’15) in Florence, Italy. It is a position paper describing our aims to automatically extract business rules from spreadsheets. [Pre-print in Arxiv.org is here]

What are Business Rules?
In her book Business Rules Applied: Building Better Systems Using the Business Rule Approach, Barbara von Halle writes that a business rule is a “statement that defines or constrains some aspect of the business. It is intended to assert business structure or to control or influence the behavior of the business”. Thus the term business rule is somewhat an over-encompassing term used to imply all sorts of assertions that govern the operations of a business. A purist’s definition of business rule has a specific syntax associated with it and fixed method of framing the rules, however the essence remains same i.e. such rules are all the assertions that unambiguously states the actions or operations that are necessary to keep a business running. Therefore business rules are what that contain key business knowledge of an organization or a domain.
What we know about businesses and organizations?
We know that a vast majority of businesses and organizations use spreadsheets for their operations. Therefore it is needless to say that spreadsheets are gold mines of crucial business knowledge with plethora of business rules implemented inside them. Inside spreadsheets they are usually hidden behind cryptic formulas and the heavy presence of numerical data make them difficult to be discerned.
What we can do about it?
By extracting the business rules from spreadsheets we can reap the business knowledge embedded inside them and make it available in a better readable format. And we intend to do that in an automated way with a software tool.
What are the potential benefits of doing that?
1) High Level Analysis of Spreadsheets – Extracting business rules enables generation of documentation for spreadsheets at a higher abstraction level than the spreadsheets themselves. This facilitates the following:
a) Comprehension – It becomes easier for end-users, who are typically not programmers, to understand the structure and operation of large and complex spreadsheets helping them efficiently work with or modify such spreadsheets with reduced errors and mistakes.
b) Comparison – Comparing spreadsheets becomes possible in order to estimate whether they implement same or similar functionalities, or even are identical behavior-wise only differing in data values. The latter cannot be done for example by an application that compares spreadsheets in data and formula level.
c) Validation – Organizations using set of well-formed and pre-laid business rules can validate whether the spreadsheets created by their employees accurately implement those rules or if there are errors in the logical level.
2) Understanding of Organizational Business Rationale – Some organization may not have their business strategies well laid out in business rule format; yet vital business knowledge of experts working in the company is hidden in spreadsheets. Extracting this knowledge would help to form a clear picture of how that organization works and its structure.
3) Support for Migration – IT architects need to understand the business logic when migrating functionalities and computations implemented in spreadsheets into conventional software. Furthermore business analysts need to ensure that the IT architects understood it correctly. This can be achieved through knowledge extraction and an automated process would largely help in this regard.
4) Safe Re-use and Replication of Spreadsheets – Often spreadsheets are created on ad-hoc basis by experts in an organization to implement their unique strategies for certain scenarios. Over time such spreadsheets grow in size and complexity and are used by several employees for similar scenarios but with different data sets. Invariably the users are forced to employ the method of copy-paste to replicate the original spreadsheet and customize it according to their needs by manipulating data and formula. However this process has been found to be extremely error-prone. It is probably safer to re-generate spreadsheets from scratch using the blueprint or structure of the original spreadsheet instead of copy-pasting. Automated business rule extraction can facilitate such blueprint generation and thus make replications of spreadsheets safer.

What goal we have set?
We want to devise an algorithm and subsequently an application that can automatically extract business rules from spreadsheets. Once that is done, we want to find answers to the two following research questions:

RQ1: How accurate the automatically extracted business rules will be as compared to those extracted manually by domain experts and spreadsheet users?
RQ2: How efficient is the automatic extraction process compared to manually extracting business rules from spreadsheets?

We aim to do user-studies and controlled experiments in which we will compare the results of automatic and manual extraction of business rules from spreadsheets.
What a typical problem will look like?


Typical spreadsheets implement business rules to calculate results. For example in the above figure the cell E19 contains the formula SUM(E13:E18). From this formula our algorithm has to infer the business rule “Total earned revenue = Admissions+…+Other earned revenue”.
Mapping E13:E18 to Admissions…Other earned revenue is straightforward. However there is more to determine as the Total Earned Revenue is divided into columns for Last Year, Current Year, etc. Thus the mapping becomes two dimensional.
Furthermore a parser will reach three blank rows and an auxiliary header row (actuals, budget, etc.) before it reaches the “Year” column header row.
Making things even more challenging, the whole structure is repeated into vertical blocks viz. Earned Revenue, Private Sector Revenue.
When mapping the rule “Total private sector revenue=…” the parser will encounter formulas in the 19th row instead of reaching the column headers! Thus, same formula repeated both vertically (in blocks) and horizontally (in year columns), yet being distinct semantically, is a considerable challenge.
What are the principal difficulties?
Spreadsheets do not impose fixed structural uniformity with regards to layout. This is an inherent flexibility of spreadsheets which at the same time is an advantage and also a disadvantage. This makes the mapping between data and labels vary to a large extent from spreadsheet to spreadsheet. Towards extracting business rules, this will be a very crucial challenge to overcome.

Aragorn User Study

In my last blog post I talked about the Aragorn tool. I am happy to announce that since that post which was quite some time back, Aragorn has evolved quite a bit. It’s much more robust now and has lesser bugs. It has also gone through some functional changes!


Aragorn is currently available as Excel 2010 add-in.


Results of user-study for evaluation of Aragorn

  1.  Category A QuestionsResponses_All_141124
  2. Category B Questions:
    1. Scenario 1: Assume it is Friday afternoon and you are eagerly waiting for the week-end to start. Your own assigned tasks are pretty much done and so you are quite relaxed. However your team-lead/manager calls you up and gives you an urgent task. One of your colleagues who had been working on a model workbook is absent due to sickness. Your lead was working on it but suddenly has been called up for an urgent management level meeting so she requests you to finish the remaining modifications on the workbook as things can’t wait. So now you are left with a workbook which probably you have never laid your eyes upon and have to make changes in the span of a few hours.
    2. Scenario 2: Assume you are running a heavy schedule on a day full of meetings. You receive a call from the client side requesting that the model workbooks that were to be sent, be sent a week earlier; as due to forecasted market dynamics they really need it earlier than expected. You hurry to check work status from your team-members and find that work is almost ready but some errors have come up in testing. In order to hurry up the whole process you join in the activity of removing the errors from these workbooks which your junior team-members have created.      Results: Responses_All_141122
  1. Windows 7 installer for Aragorn: Download (Read warnings before trying)

A quick summary of what Aragorn provides:

  • It gives you instantaneous warning about dependents of cells you are about modify; even when they are situated in different worksheets.
  • It requires one-time processing of your Excel workbook and for workbooks of size of the order of 12 – 14MB the time required for processing is about half a minute.

Please go through our original post on Aragorn for a better picture about it.

Warning & Disclaimer

Aragorn is a research initiative and is in experimental stage. Please make sure you have adequate back-up copies of your workbooks on which you want to try it out. Trying Aragorn on data or workbook of critical importance to your work is not recommended at present.

Once you activate Aragorn on a workbook, any changes you made before closing will be by default saved even if you do not save it pro-actively.

Also note that if you make some changes, you need to pro-actively save the workbook in order to see the reflection of the changes in the Arasense behavior in the same session. When you save, Arasense will re-process your workbook in order to consider the new changes.

When AraSense is in turned ON state, the ‘undo’ feature does not work. If you turn it OFF it works.

Presenting Aragorn: A Floating Prompt System for Spreadsheets and the AraSense Dependence Tracking Feature


So what (who) is Aragorn?

We all know who Aragorn is! The mighty brave king of Men, Isildur’s heir, the skilled warrior, protector of the Fellowship of the Ring and the hobbits. Unfortunately we are not talking about him here. We are in the process of developing an Excel add-in tool, which quite similar to the real Aragorn, will help you protect yourself from orcs, goblins, and such nasty creatures (mistakes, errors, pitfalls) while you work on your spreadsheets.

What made us develop Aragorn?

We were trying to have a closer look at spreadsheet dependence tracing techniques. In my blog post I discussed some of our findings from a review of the research on this topic. We found that there was still enough need for good and reliable dependence tracing techniques which can be adequately and effectively adopted by the industry in a standardized and popular way. In 2013, in her dissertationDr. Felienne Hermans wrote about some responses from spreadsheet users –


We were particularly motivated by a scenario where you, as a spreadsheet user, have the task of coming up with some calculation or financial estimate after some data crunching in a fairly large complex spreadsheet which you have not created yourself. Imagine, to obtain the required results you need to change some parameter values, modify some cells, move data around here and there. However not knowing well the full design structure of the spreadsheet you definitely do not want your changes to affect the whole model so that the results you obtain turn out to be wrong! On the other hand perhaps it´s not even worth it for you to spend days in trying to understand the spreadsheet model as it may be for a one-off use in your case. This is exactly the type of scenario where a quick, real-time prompting system displaying cell dependency information can make things a lot easier. While this was our primary motivation, we also wanted to make sure that the tool was easy and lucrative enough for you. Therefore we also aimed to ensure that the following key features were there in our tool:

  1. Effortless: We did not want to burden you with additional effort of multiple mouse clicks or peering into the screen trying to distinguish between overlapping blue arrows which all seem to point towards the same cell or no-where.
  2. Real-time: We wanted our system to be a process that is part of using the spreadsheet itself and not a separate activity to be done at a different time.
  3. Instantaneous: We wanted our system to be quick and momentary so that you do not have to wait for the information.
  4. Specific: We did not want to burden you with more information than you really need; it’s not always necessary or worthwhile to understand the whole structure of a spreadsheet when all you need to know is whether a change in one or two cells are going to affect other cells.

So what have we come up with?

The core of Aragorn is the floating prompt system. With these localized text-boxes very close to active cells, we can display crucial information about the selected cell in a transient manner.


In addition to this we have developed the feature called AraSense particularly to cater the need of dependence tracking especially for scenarios like I described above. It is our answer to the questions raised by the users.

We all know how the real Aragorn is also an expert ranger with superior tracking skills. He could track the trail of hobbits when they almost gave up hope of finding them alive!


Aragorn tracking hobbits’ trail.

Similarly the AraSense feature is the tracking power of the Aragorn tool. It can pre-sense dependencies and thus its called AraSENSE. It can do a one-time analysis of an opened Excel workbook, after which it will display a list of dependent cells in the floating prompts for every cell you click on. If you are not trying to edit the cell, normally the prompt will disappear automatically after a delay. If you are trying to edit a cell, the prompt will persist next to it as long as you are editing the cell. Even if the dependent cells are in a different worksheet, they will be listed in the prompt. You can turn the feature on or off based on your preference. No warning is shown for cells that do not have any dependents. Thus when the feature is turned on, if there is no prompt for a cell, then you can safely modify its contents without worrying. And if there is a prompt then you know which are the dependent cells and on which sheet they are. Thus you can be cautious and forewarned about the changes you make. The AraSense feature is effortless, real-time, instantaneous and specific. It fulfills all the aims we had about the system we wanted to give you.

So how it works?


So what we would like from you?

We would like to hear your thoughts on Aragorn and AraSense.

If you are a professional who uses spreadsheet at work, then do you think this tool will be helpful for you?

If you are not a spreadsheet user, but interested about end-user computing and works with usability aspects then do you think our tool addresses some key usability aspects?

Whoever you are, if you have read this and found it worthwhile of your time, we would like to thank you and would really like your thoughts and opinions. Encouragements are obviously appreciated but we would also eagerly welcome criticisms and suggestions for improvements.

New paper: Reviewing Research on Spreadsheet Dependence Tracing Techniques


Hi all. In my blog post Dependence Tracing: Need for an Investigation I talked about what is dependence tracing in spreadsheets and what is the necessity to review existing techniques to do the same. We did a study of the research that has been done on the topic and I am happy to inform you that the paper documenting the study and its conclusions is going to be published in the proceedings of SEMS’14. In this post I briefly highlight some key aspects of our findings. I recommend my previous post  Dependence Tracing: Need for an Investigation for those interested to have a little background information on the problem of dependence tracing.

We were interested particularly to understand the state of the art of research on dependence tracing. Our principal motivations arose from feedbacks we had obtained in previous user studies and surveys where there was a clear dissatisfaction among the industrial spreadsheet users about available dependence tracing tools and enhancements. To put very frankly it is an unanimously accepted fact that MS Excel is by far the most popular spreadsheet application used in the industry today. The feature that is available for dependence tracing in-built with MS Excel is the blue tracing arrows that points to precedent or dependent cells. This technique evidently is not the most convenient one for all types of scenarios that arise for an end user when she is creating or modifying spreadsheets. It has some known issues like:

  • wastage of effort and time (one mouse click too many),
  • ambiguity (arrows pointing to cells of the same row end up superimposing on each other creating confusion)
  • interference (it does indicate the dependents but it also hampers the user’s visual clarity by creating cluttered arrows all across the worksheet) that makes it inconvenient to both work on the sheet and get idea about dependence structure simultaneously.



Therefore it was our hypothesis under the paper’s context that better dependence techniques or tools developed under scientific research was not adequately making it across to the industry to benefit real-life spreadsheet users. Based on this hypothesis we identified three research questions:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?
  2. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?
  3. Is there any clear opportunity of improvement in the dependence tracing context?

In order to find answers for these questions, our first step was a review of the existing research literature on this topic and this preliminary review is practically what makes up this paper.

We shortlisted five published research papers for our study, each having description and details of some innovative idea for dependence tracing. Our approach was to characterize these innovations based on the following aspects:

  • The basic technique/principle/strategy
  • Tools or prototypes developed if any
  • Comments or details available on testing, performance, and limitations of the tool
  • Current status of the research and its implementation, and its perceived relevance or influence in the industrial scene
Paper Published In Technique Prototype Prototype Comments Industrial Relevance
  1. 1998
Fluid Visualization & Animation Pad++ (now defunct) and Python for UNIX Performance drastically worsens after 400 cells.No user study. No implementation plan given. Some ideological influence can be observed in MS Excel 2007 onwards.
  1. 1999
3D Interactive lift-up operation OpenGL APIs and SLSC for UNIX None available.  None.
    3.  2000 Visual indications superimposed on Excel sheets like arrows, highlights, shading. VBA plug-in for Excel. None available. No standardized definite influence. However similar techniques are used in ad hoc basis by expert spreadsheet users on individually or internally to organizations.
  1. 2003
Different types of graphs and trees depicting spreadsheet dependency structure. Java tool for Excel workbooks. Successfully tested on a corpus of 259 Excel workbooks of undisclosed source.No user acceptance or comprehension study. Authors apparently did not carry forward their work on this project.
  1. 2011
Leveled dataflow diagrams C# toolkit that accepts Excel workbooks for input. Extensive user testing with 27 industrial spreadsheet users in a financial asset management company. Is currently a part of service offered by the company Infotron.nl


Thus, from this review we observe that:

  1. All 5 papers provide innovative ideas to simplify dependence tracing with visualization aids.

=>  There is no lack of ideas in this context.

  1. All 5 papers provide either tools or their prototypes to demonstrate the techniques.

=> The ideas are quite conceivable.

  1. 2 out of 5 prototypes are based on defunct or obsolete technology. They have not been propagated to newer current technologies. One that is based on VBA (made in 2000) is supposed to work for all Excel versions but is actually subject to be tested for version incompatibility. Only 2 out of 5 are made with Java or C# and can be considered as modern enough to actually be working at present.

=> There is lack of recent work done based on current state of technology that can be used at present in the industry.

  1. 3 out of 5 were properly tested. One showed very poor performance as 400 cells is nothing compared to the number of cells that is found in spreadsheets used in the industry! No solution or approach was provided about how to scale this up.

=> All ideas that are conceived need to be tested for performance. If that is not done, then it is evident why it will not find itself across to the industry and actually help real-life spreadsheet users.

  1. Only 1 out of 5 was tested for useability and user-acceptance!

=> Innovative ideas are all exciting, but they are meaningless if actual users don’t find them helpful. Without user studies the effectiveness of ideas or research can not be established.

  1. Only 1 out of 5 actually ended up having industrial exposure. All the rest were practically aborted or left off by their creators.

=> There must be serious issues in making these tools actually popular to the real life spreadsheet users; or else the researchers working in the domain are not properly motivated to do anything about the problems.


We can now revisit our research questions taking into account the above observations:

  1. Why the research ideas are not getting across to the industry as popular and user-friendly implementations?

Although still not quite clear as to what is the reason behind this, one reason seems to be just plain lack of motivation on part of researchers. Otherwise why not follow up on your innovation and see to it that it gets implemented. A reason behind this reason can also be the pressure in the academia which encourages writing tons of papers on new ideas but does not give equal limelight to follow-ups and industrial implementations!? This may cause researchers to abandon their projects at the conception stage and move on to newer projects.

Another reason seems to be difficulty in performance scaling. Spreadsheets are tricky objects with sizes varying from small to very large. The range of sizes is quite phenomenal. This might be a cause due to which innovative ideas are found difficult to be put into real practical use? This reason is indicated by the fact that most of the projects we covered in the review were not extensively tested.

  1. Is there any standardized metrics framework with which such implementations as above can be compared against each other in their effectiveness to real-life spreadsheet users?

Evidently there is none. And no wonder why only one of the projects was tested extensively with a group of real-life spreadsheet users. This is a very important necessity. To come up with a set of standards and metrics that can define useability or user-friendliness of spreadsheet aid applications. In that way it will be easy to establish the true worth of innovative techniques as laid against backdrop of effectiveness to real-life users.

  1. Is there any clear opportunity of improvement in the dependence tracing context?

Based on our review and the observations obtained, it is doubtless that this question is still a very much wide open one. There can be enough opportunity found to research and come up with new tools that actually make a difference in the spreadsheet industry in form of reliable and user-friendly implementations. And not only ones that are created in ad hoc basis for internal use in organizations but robust standardized ones that can be widely adopted through out the industry.