Data Analytics: Acquiring the Data Faster White Paper

Significantly reduce the time spent acquiring data for analytics projects

Management Overview

Analytics has undergone somewhat of a revolution over the past number of years. Companies like Google, Facebook and Amazon have shown the power of analytics by simply using the data they collect as part of their everyday existence to improve how they do business and the customer experience they offer.

Advances in technology have also helped with cost effective ways to process massive amounts of data or ‘Big Data’ for analytics purposes, while there are now a myriad of ways to illustrate what the data means leading to well informed decisions. These advances have primarily been based around the new data being generated by the Internet and Social Media platforms. However, organizations built up before the Internet such as banks, insurance companies and government have been collecting data for the last 50 years of operation. Imagine what can be learned from this data?

Many are now suggesting that “data is the new oil”; if this is the case, the data these organizations have been collecting for 50 years is the premium grade oil. It represents hard facts as it generally comes from the systems of record upon which these organizations are run. While some analytics has been done in the past, it generally consisted of overnight batch reports that were delivered to someone’s desk for review. The world has moved on and this data must be become an integral part of the new analytics landscape there today.

Analytics

The following is the definition of what Analytics is from Wikipedia:

“Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.”

With technologies available today, there are literally hundreds of open source and proprietary technologies available to do the majority of the above and present it in multiple different ways. As with all technologies, some are more suitable than others for different purposes. Data visualization is often a personal preference and thus having an ability to use multiple technologies with the same data can ensure that individuals can use their own technological and visual preferences to view the data.

The problem here is that that first part of this definition is difficult. There is general consensus that a significant majority of time and cost of analytics projects is used up in the discovery and access to the data in the first place. This can represent up to 80% of a typical project’s time and budget leaving 20% to achieve the real business objective of the project; the resultant analytics.

It is possible to reduce this 80% significantly thus enabling analytics projects to be delivered far more quickly or to allow more time to do a more thorough job of the analytics of the data for the business.

Why is acquiring the data so hard

The Google’s and Facebook’s of this world were ‘born in the cloud’ and have made the concept of big data analytics look easy because, all of the data being collected is held in an open, standard fashion that facilitates acquiring the data using standard techniques from different sources and being able to process that data easily as it is, by definition, already normalized.

Take the simplest case of an organization with a reasonably uniform software environment of using only Microsoft Windows, even here it is usually the case that there are multiple databases in use. While there are ways to get at these databases they are different and potentially require separate solutions on the client side for each client to be used. In addition, for security reasons, few organizations are willing to open ODBC/JDBC connections to their data in this way:

uniform data installation diagram

In our traditional organizations, the data is likely to be on a multitude of different platforms, in different code pages and in different databases. Hence getting access to required analytics data is ever more varied in this being dependent on the platform, code page and technologies in use:

heterogeneous data installation diagram

More often than not there is generally a requirement to pull information from multiple different locations across firewalls and in these cases no organization will allow open JDBC/ODBC connections over a firewall.

So while it is possible to get at this data, it takes time (up to 80% of a project) for the following reasons:

  • Different sets of software must be configured for each database on each platform.
  • Different client driver software must be installed on each client to use the data.
  • The data comes from different sources and must be normalized so that it is comparable at the point of analysis.
  • If data is required from outside the current network, generally an offline batch process will be required to extract and move the data to the point of consumption leading to timing and other complications.
  • So can we not learn from the new world about how best to provide data for analytics?

New standards and their impact

The Internet has changed the world since its introduction and the body that has overseen this change, the World-Wide Web Consortium (W3C) and others have delivered further related standards that are key:

  • XML offers the ability to represent data from multiple platforms and technologies in a single standard way.
  • JSON is an alternative to XML that does a similar job but is preferred by some application developers.
  • REST and SOAP based services offer the ability to access data and process it from virtually any of the processing technologies whether they are open source or proprietary.
  • SSL along with LDAP access rules ensure that data is only visible to authorised individuals or groups.
  • Standard encryption algorithms ensure that data may be fully encrypted on the wire or prior to being written to disk.
  • Linked Open Data offers the potential to connect and link data from multiple different and disparate sources together to form a full picture automatically thus assisting in the discovery process prior to analysing the data.

Using these new standards offers the ability to access data more quickly and cleanly than ever before.

Using standards to access the data

The key to unlocking this data for analytics processing is to make it available in a similar way to that used by the Google’s of this world using standard data representations and standard access mechanisms. In general, this means normalizing the data into an XML or JSON representation and making it available using REST or SOAP messaging protocols.

Normalization in this sense will mean the creation of XML feeds from the various data sources that are directly consumable by open technologies regardless of the host data code pages (e.g. ASCII or EBCDIC), platform (e.g. Windows, z/OS or Solaris) or database (e.g. MS SQL Server, DB2, VSAM or Oracle).

normalization of data diagram

The representations are not restricted to XML and JSON (which are used for illustrative purposes) but can also be delivered as CSV or RDF files and other formats as may be required. Using these representations the feeds are usable by all the common analytic technologies:

usage by multiple platforms diagram

Ostia’s implementation of this architecture

While this model could potentially be built by hand, Ostia’s Portus platform provides a simple configuration based approach to enabling these feeds and making them available to all of those technologies:

Portus provided infrastructure diagram

This solution offers the following benefits:

  • Configuration based solution for creating standards based feeds using REST and SOAP and delivering data as XML, JSON, CSV etc.
  • Reusable feeds no matter where the data resides.
  • Fully secured feeds using standard security policies to ensure that only authorised personnel can access the data.
  • Service based approach ensures that feeds may be accessed across firewalls when this is authorised.
  • Standards based approach means that any tool can be used to access the feeds to create visualizations around the data. People can create the visualizations that suit them!
  • Quicker and simpler to implement.
  • Easier to maintain into the future.
  • Easier to reuse for future projects.

How can Ostia help?

Ostia have many years of experience with core IT systems and understand their strengths and weaknesses. We have also had much experience with traditional integration stacks and the models that have built up around them but now offer an ability to provide adaptive integration solutions in the timeframe required for agile projects. Ostia are able to do this because of their Portus data integration technology they have developed over the last 10 years. It has taken 10 years for Ostia to get to the front of today’s technology. We can help you stay there.