In this post I talk about my new paper, which is going to be published in the proceedings of the 2nd Workshop on Software Engineering Methods in Spreadsheets (SEMS ’15) in Florence, Italy. It is a position paper describing our aims to automatically extract business rules from spreadsheets. [Pre-print in Arxiv.org is here]
What are Business Rules?
In her book Business Rules Applied: Building Better Systems Using the Business Rule Approach, Barbara von Halle writes that a business rule is a “statement that defines or constrains some aspect of the business. It is intended to assert business structure or to control or influence the behavior of the business”. Thus the term business rule is somewhat an over-encompassing term used to imply all sorts of assertions that govern the operations of a business. A purist’s definition of business rule has a specific syntax associated with it and fixed method of framing the rules, however the essence remains same i.e. such rules are all the assertions that unambiguously states the actions or operations that are necessary to keep a business running. Therefore business rules are what that contain key business knowledge of an organization or a domain.
What we know about businesses and organizations?
We know that a vast majority of businesses and organizations use spreadsheets for their operations. Therefore it is needless to say that spreadsheets are gold mines of crucial business knowledge with plethora of business rules implemented inside them. Inside spreadsheets they are usually hidden behind cryptic formulas and the heavy presence of numerical data make them difficult to be discerned.
What we can do about it?
By extracting the business rules from spreadsheets we can reap the business knowledge embedded inside them and make it available in a better readable format. And we intend to do that in an automated way with a software tool.
What are the potential benefits of doing that?
1) High Level Analysis of Spreadsheets – Extracting business rules enables generation of documentation for spreadsheets at a higher abstraction level than the spreadsheets themselves. This facilitates the following:
a) Comprehension – It becomes easier for end-users, who are typically not programmers, to understand the structure and operation of large and complex spreadsheets helping them efficiently work with or modify such spreadsheets with reduced errors and mistakes.
b) Comparison – Comparing spreadsheets becomes possible in order to estimate whether they implement same or similar functionalities, or even are identical behavior-wise only differing in data values. The latter cannot be done for example by an application that compares spreadsheets in data and formula level.
c) Validation – Organizations using set of well-formed and pre-laid business rules can validate whether the spreadsheets created by their employees accurately implement those rules or if there are errors in the logical level.
2) Understanding of Organizational Business Rationale – Some organization may not have their business strategies well laid out in business rule format; yet vital business knowledge of experts working in the company is hidden in spreadsheets. Extracting this knowledge would help to form a clear picture of how that organization works and its structure.
3) Support for Migration – IT architects need to understand the business logic when migrating functionalities and computations implemented in spreadsheets into conventional software. Furthermore business analysts need to ensure that the IT architects understood it correctly. This can be achieved through knowledge extraction and an automated process would largely help in this regard.
4) Safe Re-use and Replication of Spreadsheets – Often spreadsheets are created on ad-hoc basis by experts in an organization to implement their unique strategies for certain scenarios. Over time such spreadsheets grow in size and complexity and are used by several employees for similar scenarios but with different data sets. Invariably the users are forced to employ the method of copy-paste to replicate the original spreadsheet and customize it according to their needs by manipulating data and formula. However this process has been found to be extremely error-prone. It is probably safer to re-generate spreadsheets from scratch using the blueprint or structure of the original spreadsheet instead of copy-pasting. Automated business rule extraction can facilitate such blueprint generation and thus make replications of spreadsheets safer.
What goal we have set?
We want to devise an algorithm and subsequently an application that can automatically extract business rules from spreadsheets. Once that is done, we want to find answers to the two following research questions:
RQ1: How accurate the automatically extracted business rules will be as compared to those extracted manually by domain experts and spreadsheet users?
RQ2: How efficient is the automatic extraction process compared to manually extracting business rules from spreadsheets?
We aim to do user-studies and controlled experiments in which we will compare the results of automatic and manual extraction of business rules from spreadsheets.
What a typical problem will look like?
Typical spreadsheets implement business rules to calculate results. For example in the above figure the cell E19 contains the formula SUM(E13:E18). From this formula our algorithm has to infer the business rule “Total earned revenue = Admissions+…+Other earned revenue”.
Mapping E13:E18 to Admissions…Other earned revenue is straightforward. However there is more to determine as the Total Earned Revenue is divided into columns for Last Year, Current Year, etc. Thus the mapping becomes two dimensional.
Furthermore a parser will reach three blank rows and an auxiliary header row (actuals, budget, etc.) before it reaches the “Year” column header row.
Making things even more challenging, the whole structure is repeated into vertical blocks viz. Earned Revenue, Private Sector Revenue.
When mapping the rule “Total private sector revenue=…” the parser will encounter formulas in the 19th row instead of reaching the column headers! Thus, same formula repeated both vertically (in blocks) and horizontally (in year columns), yet being distinct semantically, is a considerable challenge.
What are the principal difficulties?
Spreadsheets do not impose fixed structural uniformity with regards to layout. This is an inherent flexibility of spreadsheets which at the same time is an advantage and also a disadvantage. This makes the mapping between data and labels vary to a large extent from spreadsheet to spreadsheet. Towards extracting business rules, this will be a very crucial challenge to overcome.