A New Way to Compare and Reorganize Data

In real-world daily business routines, it is common that data that comes from different sources is of the same structure. Sometimes each set of data is independent and there isn’t any overlapping, like the sales data each branch office exports from their own database. Other times data overlaps heavily. In a common complete business process, it is most probably that all systems and sections input data based on their store of data. To compare the overlapped data and find and correct the possible corrupt data, we need the “auto reconciliation” operation.

Generally such a reconciliation system will traverse records of a certain set of data by loop and match them with the records one by one in another set of data. The coding logic is highly repeatable. But as reconciliation processes use different key words, and because of the change of the query goal, we often need to rewrite a script, leading to cost-inefficient development and maintenance.

esProc makes reconciliation processes intuitive and convenient by offering truly set-oriented operations. To match and reorganize data, two functions – sort() and merge() – will suffice.

 

To illustrate this, we use a simple case of merging two datasets of sales data.

 

The following two files – old.csv and new.csv – contain the predictive sales data and the actual sales data respectively. Both consist of username, selling date, sales value and sales count. The business goal is to find and analyze the newly-added data, the deleted data and the updateddata by using the userName and date as the key words, which are also called logical primary keys.

 

 

Old.csv

New.csv

 

1

2

3

4

5

6

7

8

9

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-03,8700,4

Tom,2015-03-02,3000,8

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-02,5000,5

Ashley,2015-03-01,6000,5

Rachel,2015-03-03,11700,4

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

The second and third lines in new.csv are the newly-added records, which are probably the newly-closed deals. The fourth line of records has been modified, which possibly means a change of the sale price. The third line in old.csv is the deleted record, which is maybe a cancelled order.

 

Here we omit the code written in the conventional logic for matching, and take a look at how esProc deals with this:

A

B

1

=file(“d:\\old.csv”).import@t(;”,”)

=file(“d:\\new.csv”).import@t(;”,”)

2

=A1.sort(userName,date)

=B1.sort(userName,date)

3

=new=[B2,A2].merge@d(userName,date)

 

4

=delete=[A2,B2].merge@d(userName,date)

 

5

=diff=[B2,A2].merge@d(userName,date,saleValue,saleCount)

6

=update=[diff,new].merge@d(userName,date)

result update

 

Yes, that’s it. No loop traversal. No optimization of query algorithm. And if the matching key words are changed, we just need to change a few lines. Now let’s go over the code:

       A1,B1: Import the files respectively using the comma as the separator to generate two raw datasets. esProc supports importing data from files of other formats or from the data table in a database.

       A2,B2: Use the sort() function to sort each data set by the key words, or the logical primary keys, to generate a new data set. The new sets will be of use in the subsequent steps in the merge() function.

       A3: Find the newly-added records, which are those in which both the key words userName and date are in set B2 but not in set A2. This is the difference operation of the set operations, which is specified by the function option @d. We also have @u for union operation and @i for intersection operation. We get a new set named “new”:

      

       A4: Find the deleted records, which are those in which both the key words userName and date are in set A2 but not in set B2. Note that the order of A2 and B2 in the bracket before the merge() function is different. We get a new set named “delete”:

       

       A5: Like A3, this also performs a difference operation between B2 and A2, with all fields used as the key words. The result is all the changed records, including updated ones and newly-added ones. We get a new set “diff”:

       A6: Remove the newly-added records, which are the “new” set, from the “diff” set to get the updated records. Below is the “update” set:

 

With the introduction of set operations, the newly-added, the deleted and the updated acquire an intuitive meaning. That is the differences of the old and the new sets. These differences can be conveniently represented through corresponding set operations.

 

The final result can be viewed from the IDE where the computation is performed or output to a file using the file handling function, or returned to a Java program or a reporting tool via JDBC. In the above script, B6 shows a simple method of returning the reconciliation result as result set to another system module. Below is the example of the result set used in a Java program.

 

       // establish a connection via esProc JDBC

       Class.forName(“com.esproc.jdbc.InternalDriver”);

       con= DriverManager.getConnection(“jdbc:esproc:local://”);

       // call esProc script, whose name is test and that can accept parameters

       st =(com.esproc.jdbc.InternalCStatement)con.prepareCall(“call test()”);

       st.execute();//execute esProc stored procedure

       ResultSet set = st.getResultSet();//get the result

The way esProc compares data touches the essential meaning of “data differences”, which are actually the differences of sets. In fact, data differences widely exist in each system and between systems, ranging from the huge account data in bank systems and network operator systems to the duplication detection and edition comparison in personal file systems. But after generating the to-be-compared data sets and defining the key words for matching, we can reorganize data by taking good advantage of the set operations.

http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:711488