The term ‘Big Data’ has being coined relatively recently implying that managing a large amount of data is a new thing, however, this couldn’t be further from the truth. Traditional IT centric organizations such as banking, insurance and government organizations have been dealing with very large data sets for many years. In fact there is an added complication for these organizations that in the vast majority of cases, data governance legislation mandates that much of this information must be maintained for a specified number of years depending on the data.
This has been managed in the past by using propriety technologies to archive and maintain these data sets which have had the impact of locking organizations into those proprietary technologies. The costs associated with this model are increasing yearly as more and more data must be archived in this way. In addition, the proprietary nature of these solutions makes it hard to integrate them with the open and cost effective technologies that are widely available today and can help to simplify this process.
New technologies and standards that are now out there can help significantly reduce the costs of archiving transactional data while at the same time making it fully accessible real time and through various open source solutions.
Some organizations have an active policy of archiving data and have been doing this for many years. However, all organizations are archiving data without actually thinking through the consequences. In some cases, archiving is achieved by storing the data in a data warehouse created for some other purpose and thus ‘by default’ having an ‘archived copy’, while others just let their databases continue to grow and thus ‘archive’ the data in their transactional databases.
In the first case where organizations are actively archiving data, they already know the benefits and are archiving specifically to gain those benefits. In the case where data warehouses are used as ‘archives’, many of the benefits are being missed to a degree while the last case where databases are continually allowed to grow misses all of the benefits of data archiving and is simply not sustainable into the future. In the last two cases at least, the cost of archiving data in this way is far more than it has to be.
Attributes of Data to be archived
Consider the attributes of data that is generally archived:
Transactional systems must be robust, resilient and scalable and thus will use bullet proof database and hardware technologies to support them. These systems must be available 24/7 and the costs associated with them understandably reflect that. For this reason, there are many reasons why explicit policies to archive data out of these systems provide benefits to an organization:
The physical archiving of data is perhaps only half the battle. There are further issues that must be dealt with around the data being archived:
The cost of storage has been dropping exponentially in the past number of years yet many archiving solutions, either explicitly or in the way that they have been implemented; depend on expensive proprietary hardware and software. Often, these platforms are designed to be as robust, resilient and scalable as the transactional data platform but such service levels are not required for data that is being archived.
As the archived data will be read only, there is no need for best of breed technologies to manage the data. Low cost commodity hardware and freeware software offer a cost effective platform on which to store the archived data. Once written to the platform, it can immediately be backed up such that if the hardware fails for some reason, it can easily be retrieved from the back up version.
There are further benefits for organizations wishing to use public or private clouds to store the data. This again is a cost effective way to store the data as the cloud based servers can be created and left idle most of the time and only paid for when they need to be fired up for use. Where data residency is an issue, using a private cloud based solution can address this concern. In addition, many of the larger cloud providers will offer guarantees as to data residency as part of their service level agreements (SLAs).
These new technologies offer the ability to significantly reduce the cost of actually storing the archived data.
The Internet has changed the world since its introduction and the body that has overseen this change, the World-Wide Web Consortium (W3C) and others have delivered further related standards which can be used in archiving solutions:
Using these new standards offers a flexibility to deal with this archived data in ways never before envisaged with proprietary implementations.
The first element in the implementation of such an architecture is the normalization of the data feeds from the various back office databases. Normalization in this sense will mean the creation of XML feeds that are directly consumable by the archiving engine regardless of the host data code pages (e.g. ASCII or EBCDIC), platform (e.g. Windows, z/OS or Solaris) or database (e.g. MS SQL Server, DB2, VSAM or Oracle). Creating an XML feed also allows the creation of a stand-alone style sheet (XSLT) to represent the data in a meaningful way.
This will ensure that the archiving engine can use a standards based approach to define data archiving policies on the data by using these XML feeds regardless of the original source of the data.
Once policies have been defined, the implementation engine can then archive the data to commodity hardware and software platforms either on premise or in a private or public cloud depending on data governance requirements. This represents a cost effective mechanism for storing this read only data. The active style sheet for the data will be archived with the data and is thus always compatible with the data with which it has been archived.
Once archived, again due to the standards that are in place, all of the data in the archive will be available as REST accessible XML objects enabling an internal Google or other crawler to build up an index on the data thus enabling Google like searches to be done on the data in the archive enabling data to be found easily. Once found, the data can be displayed using the style sheet archived with the data.
While this model could potentially be built by hand, Ostia’s Portus platform provides a simple, configuration based approach to enabling these feeds and in conjunction with Ostia’s policy definition and archiving components offers a simple, configuration based (i.e. no coding required) end to end implementation of this architecture.
This solution offers the following benefits:
Ostia have many years of experience with core IT systems and understand their strengths and weaknesses. We have also had much experience with traditional integration stacks and the models that have built up around them but now offer an ability to provide adaptive integration solutions in the timeframe required for agile projects. Ostia are able to do this because of their Portus data integration technology they have developed over the last 10 years. It has taken 10 years for Ostia to get to the front of today’s technology. We can help you stay there.