Data Archiving: New Standards White Paper

Management Overview

The term ‘Big Data’ has being coined relatively recently implying that managing a large amount of data is a new thing, however, this couldn’t be further from the truth. Traditional IT centric organizations such as banking, insurance and government organizations have been dealing with very large data sets for many years. In fact there is an added complication for these organizations that in the vast majority of cases, data governance legislation mandates that much of this information must be maintained for a specified number of years depending on the data.

This has been managed in the past by using propriety technologies to archive and maintain these data sets which have had the impact of locking organizations into those proprietary technologies. The costs associated with this model are increasing yearly as more and more data must be archived in this way. In addition, the proprietary nature of these solutions makes it hard to integrate them with the open and cost effective technologies that are widely available today and can help to simplify this process.

New technologies and standards that are now out there can help significantly reduce the costs of archiving transactional data while at the same time making it fully accessible real time and through various open source solutions.

Data Archiving

Some organizations have an active policy of archiving data and have been doing this for many years. However, all organizations are archiving data without actually thinking through the consequences. In some cases, archiving is achieved by storing the data in a data warehouse created for some other purpose and thus ‘by default’ having an ‘archived copy’, while others just let their databases continue to grow and thus ‘archive’ the data in their transactional databases.

In the first case where organizations are actively archiving data, they already know the benefits and are archiving specifically to gain those benefits. In the case where data warehouses are used as ‘archives’, many of the benefits are being missed to a degree while the last case where databases are continually allowed to grow misses all of the benefits of data archiving and is simply not sustainable into the future. In the last two cases at least, the cost of archiving data in this way is far more than it has to be.

Attributes of Data to be archived

Consider the attributes of data that is generally archived:

  • It is a record of something that has happened and therefore the record itself will never change in the future.
  • It is unlikely to be needed for day to day operational requirements and in most cases is maintained because it must be available in the future if someone requests to see it.
  • It does not have to be available real time.

Benefits

Transactional systems must be robust, resilient and scalable and thus will use bullet proof database and hardware technologies to support them. These systems must be available 24/7 and the costs associated with them understandably reflect that. For this reason, there are many reasons why explicit policies to archive data out of these systems provide benefits to an organization:

  • Maintaining read only data in an active transactional database is a very expensive way to store data which could potentially be archived.
  • This read only data will lead to increased index sizes to data and can potentially have an impact on the performance of the database for various queries that use these indexes.
  • This read only data is being backed up every time the database is being backed up thus leading to longer backup times, larger backups and longer batch windows which are actually getting shorter with many organizations wishing to eliminate them completely.
  • The archived data can potentially be used for analytic processing without impacting the transactional systems or any data warehousing activity.

Additional Issues

The physical archiving of data is perhaps only half the battle. There are further issues that must be dealt with around the data being archived:

  • Locating information from the archived data can itself be a challenge due to the length of time that may have elapsed since data was archived and the amount of archived data.
  • The application that made sense of the data may have been retired and only the raw data available so it may be difficult to interpret the data a number of years after it has been archived.
  • The structure of the data can also change over time perhaps with the addition of new columns. The ability to sensibly show different versions of the archived data will be problematic even if the application that can display the data is still available as it is likely to only understand the newest structure of the data.

New technologies and their impact

The cost of storage has been dropping exponentially in the past number of years yet many archiving solutions, either explicitly or in the way that they have been implemented; depend on expensive proprietary hardware and software. Often, these platforms are designed to be as robust, resilient and scalable as the transactional data platform but such service levels are not required for data that is being archived.

As the archived data will be read only, there is no need for best of breed technologies to manage the data. Low cost commodity hardware and freeware software offer a cost effective platform on which to store the archived data. Once written to the platform, it can immediately be backed up such that if the hardware fails for some reason, it can easily be retrieved from the back up version.

There are further benefits for organizations wishing to use public or private clouds to store the data. This again is a cost effective way to store the data as the cloud based servers can be created and left idle most of the time and only paid for when they need to be fired up for use. Where data residency is an issue, using a private cloud based solution can address this concern. In addition, many of the larger cloud providers will offer guarantees as to data residency as part of their service level agreements (SLAs).

These new technologies offer the ability to significantly reduce the cost of actually storing the archived data.

New standards and their impact

The Internet has changed the world since its introduction and the body that has overseen this change, the World-Wide Web Consortium (W3C) and others have delivered further related standards which can be used in archiving solutions:

  • XML offers the ability to represent data from multiple platforms and technologies in a single, standard way.
  • XSLT offers an ability to easily create representations of data that make sense into the future whether an application is retired or not.
  • REST and SOAP based services offer the ability to access data and process it from virtually any of the processing technologies whether they are open source or proprietary.
  • SSL along with LDAP access rules ensure that data is only visible to authorised individuals or groups.
  • Standard encryption algorithms ensure that data may be fully encrypted on the wire or prior to being written to disk.
  • Web crawling engines such as that used by Google offer the ability to search through and catalogue all of the data in the archive and enable private ‘Google like’ searches on data thus enabling easier location of data when it must be retrieved.
  • Linked Open Data offers the potential to connect and link data from multiple different and disparate sources together to form a full picture automatically.

Using these new standards offers a flexibility to deal with this archived data in ways never before envisaged with proprietary implementations.

Reference Architecture to achieve this

The first element in the implementation of such an architecture is the normalization of the data feeds from the various back office databases. Normalization in this sense will mean the creation of XML feeds that are directly consumable by the archiving engine regardless of the host data code pages (e.g. ASCII or EBCDIC), platform (e.g. Windows, z/OS or Solaris) or database (e.g. MS SQL Server, DB2, VSAM or Oracle). Creating an XML feed also allows the creation of a stand-alone style sheet (XSLT) to represent the data in a meaningful way.

data normalization diagram

This will ensure that the archiving engine can use a standards based approach to define data archiving policies on the data by using these XML feeds regardless of the original source of the data.

define archiving policies diagram

Once policies have been defined, the implementation engine can then archive the data to commodity hardware and software platforms either on premise or in a private or public cloud depending on data governance requirements. This represents a cost effective mechanism for storing this read only data. The active style sheet for the data will be archived with the data and is thus always compatible with the data with which it has been archived.

archive based on policies diagram

Once archived, again due to the standards that are in place, all of the data in the archive will be available as REST accessible XML objects enabling an internal Google or other crawler to build up an index on the data thus enabling Google like searches to be done on the data in the archive enabling data to be found easily. Once found, the data can be displayed using the style sheet archived with the data.

locate using Google techniques diagram

Ostia’s implementation of this architecture

While this model could potentially be built by hand, Ostia’s Portus platform provides a simple, configuration based approach to enabling these feeds and in conjunction with Ostia’s policy definition and archiving components offers a simple, configuration based (i.e. no coding required) end to end implementation of this architecture.

Portus provided infrastructure diagram

This solution offers the following benefits:

  • Configuration based solution for archiving your transactional data.
  • Common solution regardless of platform or database technology.
  • All data archived to a single common target system in a standard, common format.
  • Representation of the data is archived with the data thus the data is viewable no matter what version of the data structure or if the original application that interpreted the data has been retired.
  • Archived data stored in cost effective commodity hardware and software components.
  • Standards based approach means that any tool can be used to access the archive to locate data or to create visualizations around the data.

How can Ostia help ?

Ostia have many years of experience with core IT systems and understand their strengths and weaknesses. We have also had much experience with traditional integration stacks and the models that have built up around them but now offer an ability to provide adaptive integration solutions in the timeframe required for agile projects. Ostia are able to do this because of their Portus data integration technology they have developed over the last 10 years. It has taken 10 years for Ostia to get to the front of today’s technology. We can help you stay there.