Documenting the Modern Day Data Warehouse

Documenting the Modern Day Data Warehouse

Engaging Data Explains:

Documenting The Modern Day Data Warehouse


When you’re operating a modern-day data warehouse, documentation is simply part of the job. But it’s not necessarily the easiest or most logistically straightforward part of the process, while also being important. Documentation is, in fact, invaluable to the continued development, expansion, and enhancement of a data warehouse. It’s therefore important to understand everything that is entailed in adequately documenting, in order to ensure that your data warehouse processes run smoothly.

Understanding your Audience

One of the first things to understand is who you are compiling the documentation for. Support, developers, data visualisation experts, and business users could all be possible recipients. Before you answer this question, you really need to fully understand the way that your organisation operates, and open the lines of communication with the appropriate departments.

A two-way dialogue will be productive in this ongoing process. This process of communication will then help ensure that you keep the documents in line with the design. This is vitally important, as any conflicts here can render the whole process less than constructive than is ideal.

And it’s especially vital considering how fast documentation moves nowadays. Everything has gone online, and is based on Wiki. Whether it’s Confluence, SharePoint, or Teams, all sorts of Wiki documents are being produced by businesses with the intention of sharing important information. These shareable documents are updated with increasing regularity, meaning it is important to get your strategy in place before beginning.

Different approaches to data warehouse design can also impact the amount of time that a document is live before being updated. If you are lucky enough to make weekly changes to your data warehouse, you will be making incremental changes to the documentation itself. Development teams spend hours on updating the documentation rather than doing what they are good at….developing data solutions! Naturally, minimising this where possible is always preferable. 

Self-Service Business Intelligence

Documentation is also crucial in self-service business intelligence. The integration of private and local data in this area, into existing reports, analyses or data models, requires accurate documentation. Data can be drawn in this area from Excel documents, flat files, or a variety of external sources.

By creating self-service functionality, business users can quickly integrate data into what can often be vital reports. Local data can even be used to extend the information delivered by data warehousing, which will limit the workload that is inevitably incumbent on data management. The pressure on business intelligence can be quite intense, so anything that lessens the load is certainly to be welcomed.

Another important aspect of documentation is that it reduces the number of questions that are typically directed at the IT and data warehousing teams. One thing anyone that works in IT knows only too intimately is the vast amount of pressure that can be heaped upon them by both internal and external enquiries. Again, anything that reduces this will certainly be favourable.

The data warehouse team also has huge responsibility within any organisation. They are required to produce a vast amount of information for front-end business users, and getting documentation right can certainly assist with this process.

Importance of Transparency

One important aspect of documentation that can sometimes be overlooked is the importance of transparency. This works on every level of an organisation, with the importance of sharing everything related to documents absolutely vital. Once this level of transparency is implemented, people who understand the data deeply can improve the documentation, or suggest changes to the Extract, Transform, and Load (ETL) and Extract, Load, and Transform (ELT), if this is indeed deemed necessary.

Conversely, it’s also important to understand that not all technology is suitable for documentation. As much as businesses and organisations would love this process to be completely holistic, this is not always possible.

Thus, packages such as Power BI, QlikView and QlikSense, and even Microsoft’s trusty Excel, are not necessarily ready to be documented. These software packages can use data, but often do not have the ability to provide a document set that explain how the data is being used, and for what purpose. Recently, Power BI has taken steps to ensure that the app can help with data lineage, but this remains better suited to IT teams, as opposed to Business Users.

Attempting to document data across multiple technologies is tricky, but Wikis can provide IT teams with the ability to collate all of this information into a central hub of knowledge, making access much more logistically convenient.

Conclusion

Ultimately, IT departments, data warehousing teams, and report developers should all be encouraged to produce documentation that contributes to the overall aims of their organisations. Anything excessively technical is not good enough for modern business requirements, especially considering the importance of communication, and of ensuring that everyone within an organisation is acquainted with as much vital data as possible.

Modern-day technology makes this goal a reality, and this means that it is increasingly an expectation of end-users. Failing to prepare properly in this area could indeed mean preparing to fail, as organisations will simply have failed to meet the compelling desires of the market. It is therefore vital for documentation to be dealt with diligently.

Getting this piece right, will go a long way to help with data governance!


If you would like to know more about how Engaging Data help companies to automate documentation, please contact us on the below.



Big Data and DataVault

Big Data and DataVault

Engaging Data Explains:

Big Data and DataVault


Knowing how and where to find the needle more easily, and where in the specific haystack it resides

Big Data has been a hot potato topic for more than a few years now, and this phenomenon will play a central role in the future of commerce. Collecting, collating and comprehending Big Data will no longer be a matter of commercial interest; it will instead increasingly become a commercial imperative.

It should come as no surprise then that investment in technologies related to Big Data is already becoming almost ubiquitous. A report from NewVantage Partners, which collected executive perspectives from 60 Fortune 1000 companies, found that 97% of them invest in Big Data and AI initiatives. NewVantage also discovered that the vast majority of this investment (84%) was focused on deploying advanced analytics capabilities to enable business decision making.

Big Understatement

And when we use the term ‘Big Data’, it’s reasonable to conclude that ‘big’ is an understatement! For example, in 2018, Internet users generate approximately 2.5 quintillion bytes of data every day. That’s 912 quintillion bytes every year! And 90% of this data has been generated in just the last five years. The rate of growth and development of this curve is exponential.

Thus, it’s one thing to recognise the importance of Big Data, and quite another to be prepared for it. We’re talking about a veritable avalanche of information! In many cases, utterly unstructured information. Indeed, Forbes noted in 2019 that 95% of businesses cite the need to manage unstructured data as a problem for their business. Which, given the sheer scale of Big Data, is hardly surprising. Making the most of Big Data is not so much searching for a needle in a haystack; more like looking for a needle in a universe entirely comprised of haystacks.

This reality means that implementing the best business intelligence solutions will become essential. Dealing with the sheer volume of Big Data will demand this. And data warehousing is one element of this process that will be critically important. The analytical qualities delivered by this aspect of the overall Big Data management process will prove critical in the success of the efforts of companies to benefit from the information explosion.

Data Vault 2.0

That’s where Data Vault comes in. Data Vault 2.0 comprises a raft of sophisticated architecture and techniques that enable businesses to both store current and historical data in a singular and easily accessible location, along with the ability to create analytics based on this information. Data Vault is effectively a unique design methodology for large scale data warehouse platforms, ensuring that Big Data is dealt with more quickly, more efficiently, and more effectively.

Data Vault offers several advantages over competitors. The first reason for this is that it’s possible to convert any system to Data Vault determinations. This means that existing objects can be translated to Data Vault entities, and every single item will have a corresponding match in the new Data Vault architecture. Every main definition can then be mapped by hubs and every relationship between these via links. This means that the whole operation is more flexible and user-friendly.

Another significant advantage of Data Vault is its enhancement of agility. This is particularly important, as the ability of network software and hardware to automatically control and configure itself makes it easier to deal with the almost unfathomable scope of Big Data.

Smaller Pieces

Data Vault makes it possible to divide a system into smaller pieces, with each individual component available for separate design and development. This means every constituent part of the system can have its own definitions and relationships and that these can be combined at a later date by related mapping. This makes it possible to develop a project steadily yet still see instant results. It also makes managing change requests much more straightforward.

Another asset of the Data Vault approach is that it applies to numerous different systems. This means that separate sources can be transformed into Data Vault entries without any laborious procedures being involved. It is particularly advantageous in the contemporary climate, as almost every enterprise system relies on several different data types from various data sources.

The Data Vault modelling technique is thus adaptable to all types of sources, with a minimum of fuss. This makes it much more feasible to link different data sources together, making analysis more joined-up and holistic. It is well-known that being the entity that is the most adaptable to change is vital across a wide variety of niches, and this applies in the rapidly evolving data analysis environment.

But possibly the most compelling reason to choose Data Vault is that our offering provides companies with a method of standardisation. With Data Vault implemented, companies can standardise their entire DWH system. This standardisation enables members of the company to understand the system more easily, which is undoubtedly advantageous considering the innate complexity of this field.

Meeting the Needs

It is commonplace for complex and sophisticated solutions to be delivered to business users, which nevertheless fail to understand and adapt to the company’s actual requirements in that area. Everyone wants to show off their fancy piece of kit, but often developers aren’t as keen to listen! This can manifest for a variety of reasons. Still, the important thing to note is that Data Vault is designed to meet the requirements of the business, rather than requiring a business to reorganise itself to comply with the needs of the package.

This is important at a time when the dynamic complexity associated with data is escalating. Enterprise data warehouse systems must provide accurate business intelligence and support a variety of requirements. This has become a critical reality in a business marketplace in which the sheer volume of data being generated is overwhelming.

Data Vault solves these problems with a design methodology that is ideal for large scale data warehouse platforms. With an approach that enables incremental delivery and a structure that supports regular evolution over time, Data Vault delivers a standard for data warehousing that elevates the whole industry.

Data Masking

Data Masking

Engaging Data Explains:

Data Masking


The Data Masking Challenge 

One of our clients had an interesting data masking requirement. How to mask Production data to meet with GDPR and IT security policies. The data needed to be human readable enabling the development and testing teams to create a data feed for a new Client Portal system. However, the core system did not have the ability to mask the data, only scramble or obfuscate. The core system was extremely complex, built & expanded on over 10 years. It is difficult to understand the system & how data is stored because documentation didn’t exist!

Furthermore, the architecture restraints meant there was not enough storage space to hold a second (in-line) database with masked production data.

Is This A Common Problem?

The more companies we speak to, the more complex or complicated situations we find. From our experience, we’ve found a pattern emerging in the common problems or requirements:

  • Old Tech – Ageing trading platforms/core systems or sources of data often don’t have the functionality to masked data. Those that do or have extensions/plug in to mask the data often take a long time to process or do not have the flexibility to fit every scenario.
  • Quick turnaround – Near realtime data is nice to have, but not always a real requirement.
  • Specific/varied masking – Different types of masking needed, obfuscation, scrambled, encrypted or human readable & randomised.
  • Storage – Limitations on storage or infrastructure makes it difficult to store an entire copy of production. 
  • Cost – Large database providers offer alternative tools with the same effect but also command a very large price tag.
  • Time – Developers can develop hand-cranked specific solutions which take reasonable amounts of time to develop but much longer to test to ensure the solution is working as expected.
  • Doing the right thing – Most clients want to do the right thing to meet regulatory requirements but see this as a complicated housekeeping chore and recognize the risk but choose to ignore it.

Engaging Data Discovery

We had a lot of options to solve this problem, but selected Redgate Data Masker and here is why:

  • After a review of the underlying data structure, it was too difficult, costly & time intensive to try to transfer the data into the Test environment and apply masking rules.  
  • We discovered that it would take 32 to 48 hours to copy the “majority” of the data from Production to UAT environments. Doing this would copy most but not all of the data creating a potential for leaving things behind. Plus it would take more time to run the system’s own obfuscation processes (another 8 hours).
  • Masking not Obscuring. Create human-readable values. i.e. Mr. Smith converts to Mr. Jones. This was not available from the trading platform’s masking function.
  • Defined values. Create predictable values, such as a telephone number set format or date of birth.
  • There was a lack of documentation regarding the location of personally identifiable data. This could result in the process missing part of the system if we processed the whole database.
  • We had a requirement to build in a verification process, comparing the masked data against the source. This report would answer the question – “have we missed masking any records?”

We created a simple plan to extract the data, load into a SQL database and then mask. Only taking required data increased efficient use of storage and reduced processing time. This would allow the Client’s development team to export the masked data and transfer into the Client Portal. 

Choosing The Right Tool

Identifying the data was a difficult manual process because of the core system’s table/column naming convention. Engaging Data’s Consultant used the WhereScape 3D product, which documented the structure of the system into a metadata layer. The consultant worked with the business teams to update the metadata layer & highlight fields that contained personally identifiable data. In addition, we added business definitions. Using an agile approach, each columns type of data masking requirement was agreed, along with how data joined and stored/reused in different tables. Helpfully, WhereScape 3D provided all the known diagrams and suggested relationships, helping to reduce the investigation time.

At the end of this exercise, WhereScape 3D produced detailed documents of the core systems data structure as well as analysis of the data cardinality/profiles. It uncovered some interesting points about the system, including some parts of the system that held personally identifiable data, that the client had not known existed.

Putting The Data Masking Solution Together

Using the information within the metadata; WhereScape’s Red imported the physical structure of the system and automating the extraction of data into a SQL database on a scheduled basis. We started off daily, but later to increase to every hour.

Now that the data was at rest in the SQL database, our consultant used Redgate’s Data Masker to convert the personally identifiable data to a data set, based on the agreed rules held within the metadata. Once the rules had been designed, WhereScape’s Red scheduler automated the masking so that it started as soon as the loading has completed. 

Data processing, including masking and being loaded into the target database, took place within 4 hours (initially). Not too onerous and very timely compared to other options. More importantly, meant we reduced processing time by a further hour.

Did The Data Masking Work?

Using WhereScape Red, the Engaging Data consultant was able to build a comparison process, that utilised the metadata (only using those field marked as containing personally identifiable data) and compare the values before and after the process. 

The processed ends with an automatic email of the data masking comparison report. This report contains a summary of field error analysis as well as a number of field errors per record. The latter was used to fail the process & prevent the data from be transferred to the target database. Automating this, enabled the Client to feel confident that the process was working correctly.

In Conclusion

All sorts of tools can be used to mask data. We find the best of them will automate the process allowing you to decide how to mask, when to mask & how frequent to do it.  


If you would like to learn more about this Redgate‘s Data Masker, WhereScape Red or how we can help with your data project, please feel free to contact office@engagingdata.co.uk