“You see, but you do not observe.” Sherlock Holmes, A Scandal in Bohemia.
Knowledge, not mere data, is the real deal. Data is abundant in organizations, but not always in the structure and form that can be easily processed, analyzed and leveraged. As a result, companies end up ignoring, intentionally or otherwise, most of the data that they create and accumulate over time. It is this ignored data that can enable them to understand complex business phenomena, improve decision-making, and help them lead in the age of AI.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
What is Dark Data?
Dark Data refers to the entire gamut of data assets that organizations create, compile or collect as part of their day-to-day business activities but do not leverage for decision making, organization development, revenue generation, or other critical purposes. This is primarily due to the fact that organizations regard, rightly or wrongly, most of this data as irrelevant; cannot access and/or use them due to limitations in processing or technology; or are simply not aware of them in the first place. Most industry and research reports posit that this Dark Data could be as high as 90% of the total data assets of a typical organization.
Dark data covers a wide range of data assets, such as application and network logs, audio and video data, documents and presentations, emails and messages, financial statements, images, minutes of meetings, sensor and telemetry data, and archived data from business processes (e.g., ex-employee information, old compliance or legal data, obsolete product information, etc.) This data is often non-indexed, sometimes incomplete, and mostly unstructured, such as the information that is buried within texts, tables and figures.
Critical information is often hidden within Dark Data in the form of patterns – simple and complex, frequent and infrequent, discrete and sequential. Discovering and decoding these patterns help companies address a wide range of critical questions, such as:
- Where are the inefficiencies and leakages in our Supply Chain operations?
- How can we achieve more accurate customer journey mapping?
- Can new revenue streams be created from our existing business operations?
- What do our employees feel about our employment policies?
Why is Dark Data becoming increasingly significant?
Firstly, data is regarded as the fuel or electricity or digital currency that drives business success in the era of Artificial Intelligence. Companies that access, control, manage and leverage their data assets better, both quantitatively and qualitatively, will end up leading their respective competitive landscapes. Consequently, it is important for organizations to tap into their huge volumes of Dark Data assets, and arm themselves with deeper insights and better decision-making capabilities than their competitors.
Secondly, data privacy and security regulations are gradually getting more stringent. The mismanagement of Dark Data may lead to regulatory issues, non-compliance problems, and financial or legal liabilities. Additionally, there is also the risk of leakage of sensitive and confidential company information (such as business plans or intellectual property data), thereby reducing the company’s competitive edge in the marketplace.
A deep understanding of both ‘known and unknown’ patterns in hidden data assets is critical to unlocking real value from data that resides within an organization. It can lead to improved risk management, critical insight generation, enhanced product development or service delivery, greater cost savings, and several other benefits. As the data assets of organizations grow exponentially over time, so will the quantum of Dark Data. Consequently, it will become imperative for companies to tap into their data assets for any form of decision-making or business operations management.
Companies have been investing heavily in analytics and data science, including creating positions of Chief Analytics Officers and Chief Data Officers. However, the fact remains that most companies are still behind the curve in being able to extract optimal business value from their data assets. Perhaps the most important reason for this is the inability to mine 80% to 90% of the total data assets – the Dark Data.
Creating ‘real’ value in the Data Economy is not easy. Companies need to go beyond Analytics and traditional Data Science, and adopt new age practices and strategies.
Challenges in Dark Data Engineering
The most common method of capturing and storing Dark Data today is through Enterprise Data Lakes. However, many companies rely on basic data engineering practices that often end up being sub-optimal. Furthermore, there are technical limitations while dealing with the problem of Dark Data, particularly because it is an extremely heterogeneous entity.
Here are the main challenges in Dark Data Engineering:
- Dark Data Cleaning is a challenge, both conceptually and technically. Firstly, there are technical limitations for enforcing quality constraints on this type of data. Secondly, there are different opinions on what should be in-scope/out-of-scope in this cleaning.
- Dark Data Ingestion and Extraction are slow processes that involve significant human efforts in almost every domain. Limited innovation has been witnessed in these two areas.
- Dark Data Integration is arguably the most critical challenge. Integrating a myriad of data forms that are ingested through multiple sources is a highly complex problem that has no ready-made solution. Furthermore, this problem needs advanced engineering skills for both solution designing and execution.
- Metadata Management is a challenging area in Big Data, and the problem gets even bigger in the case of Dark Data. As a result, certain functionalities that are vital for Dark Data Engineering, such as topic-based data discovery and domain-specific metadata querying, cannot be efficiently developed or implemented.
- Data Lakes generally do not support advanced features like sophisticated indexing or on-demand query response. As a result, there are limitations in leveraging Data Lakes for Dark Data Engineering. Furthermore, poorly architected Data Lakes end up more like Data Swamps, thus yielding only a fraction of the intended benefits.
- Embedded Data Engineering, generally used for IoT applications, has its own technical limitations in terms of bandwidth management and parallel processing.
The above is only a high-level and partial list. Each Dark Data project is unique, and throws up multiple dimensions of the above engineering challenges.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
Introduction to Pattern Mining
Pattern Mining refers to the process of decoding meaningful and structured knowledge (patterns) that are hidden within raw data. Traditionally, researchers and data scientists have largely focused on Frequent Pattern Mining, where the goal is to understand associations and relationships that frequently appear in datasets. Most of the classical pattern mining techniques and algorithms were developed in the late 1990s and 2000s.
This section highlights the main pattern mining techniques and their associated algorithms. [Note that pattern mining techniques do not work in isolation, but are often deployed in conjunction with Machine Learning like Supervised and Unsupervised Learning.]
Association Rule Mining is perhaps the most prominent pattern mining technique, and forms the basis of most other forms of pattern mining. The best example of its application is the Market Basket Analysis, where customer buying behaviour is studied by analyzing itemsets that are frequently bought together. The main algorithms are Apriori, Eclat and FP-Growth (Frequent Pattern Growth.)
Sequential Pattern Mining (SPM), particularly of the frequent kind, is one of the widely applied pattern mining methods where the objective is to mine data that occur as a sequence, such as customer shopping sequences, disease treatment data, DNA sequences, geological data of natural disasters, data, and others. The main algorithms are as follows:
- Apriori-based methods: GSP (Generalized Sequential Patterns), AprioriALL
- Vertical Format-based: SPADE (Sequential Pattern Discovery using Equivalent Class), SPAM (Sequential PAttern Mining)
- Pattern Growth-based: FreeSpan, PrefixSpan
- Constraint-Based Methods: GTC (Graph for Time Constraints), SPIRIT (Sequential pattern mining with regular expression constraints)
- Closed Sequential Pattern Mining: CloSpan, ClaSP
- Sequential Pattern Mining in Data Streams: SS-BE (Stream Sequence miner using Bounded Error), SS-MB (Stream Sequence miner using Memory Bounds)
- Other methods: IncSpan (Incremental Mining of Sequential Patterns), Mining Closed Repetitive Gapped Subsequences
Multi-dimensional Pattern Mining involves the study of patterns that are associated with multiple dimensions. Some of the important algorithms are UNISEQ (Uniform Sequential), SeqDIM, DIMSeq and Songram.
Infrequent Pattern Mining refers to the mining of patterns that occur infrequently but can provide useful information. The main algorithms include CARM (Confabulation-Inspired Association Rule Mining), MIIM (Minimum Infrequent Itemset Mining), PandNAR (Positive and Negative Association Rule), Pattern-Growth Paradigm and Residual Trees, and RAR (Rare Association Rule.) Negative Pattern Mining and Rare Pattern Mining are variants of this technique.
Other mining techniques include:
- Graph Mining/Subgraph Mining: Frequent Graph Mining (FGM), Path Mining, Frequent Subgraph Mining (FSG), gSPAN (Graph-based Substructure Pattern Mining).
- Subtree Pattern Mining: TreeMiner, FREQT, Gaston.
- High-Dimensional Data and Colossal Pattern Mining: Pattern Fusion
- Compressed or Approximate Pattern Mining: Top-k Most Frequent Closed Patterns, Pattern Clustering.
- Collaborative Pattern Mining
Recent Innovations in Pattern Mining
This decade has witnessed tremendous advances in pattern mining. Some of these notable developments are highlighted below.
- Deep Learning-based innovations are gradually making it to the world of pattern mining, especially for high-dimensional data like images and videos, and for spatio-temporal pattern inference.
- Existing areas like Graph Mining, Compressed Pattern Mining, High Dimensional Data Mining, Infrequent/Rare Pattern Mining continue to advance with the development of new techniques and algorithms.
- Distributed Sequential Pattern Mining is emerging as a major area of innovation, largely driven by the advances in parallel processing and distributed computing.
- High-Utility Pattern Mining, where the focus is on discovering patterns with high utility or importance in different kinds of data, is one of the major emerging areas.
- Maximal Sequential Pattern Mining is another emerging area where compact representations of patterns are generated for a better understanding of the results.
- Semantic Annotation of Frequent Patterns, where the goal is to obtain semantic information about the patterns for better understanding them, is also emerging.
- Other innovations include new algorithms like CloFAST for Closed SPM; ProSecCo for Progressive SPM; Skopus for Top-k SPM; and CMAP (Co-occurrence MAP) based innovations, such as CM-SPAM, CM-SPADE and CM-ClaSP.
Challenges in Pattern Mining
Despite recent advances and innovations in pattern mining, significant challenges still remain. Some of these are technical/structural, and some are architectural/procedural in nature. Here is a list of the main challenges.
- The distinction between Dark Data Frequent and Infrequent/Rare Pattern is often ambiguous in nature. This distinction is highly domain-and-context-dependent, and often leads to the misapplication of pattern mining techniques, which leads to poor analysis and inaccurate results.
- Infrequent/Rare Pattern Mining techniques are still not as evolved as their frequent counterparts. This is especially true when singular datasets have high variability or multiple data distributions.
- Large-scale pattern mining, especially in high-performance environments, is a major challenge. Efforts are in progress to develop Massively Parallel Processing (MPP) systems for pattern mining, but a lot more work is needed on this front.
- Pattern mining of complex data types, such as spatio-temporal sequence data, stream data or uncertain data, has serious limitations.
- Deep Learning-based pattern mining is still in its formative stages. Several neural network architectures have been developed for sequential mining, but these have yet to mature for production use.
- Real-time Dark Data pattern mining capabilities do not generally exist. Data Lakes, despite claims made by companies, are only intermediate repositories today. As a result, they do not serve as primary sources of analysis. The Dark Data gets processed first, and for pattern mining at a later stage.
- The technical limitations of emerging approaches, such as online progressive mining or incremental/decremental mining, are yet to be addressed. As more innovative mining approaches are developed, these limitations continue to grow.
- Other challenges like data privacy, lack of advanced visualization, and exception handling exist.
Closing Comments
Is pattern mining from Dark Data difficult to achieve? Yes, at least at the moment.
So, should it be an immediate focus area for organizations? Absolutely, if companies aim to be Disruptors (and not ‘The Disrupted’). The earlier they start the exercise of tapping into their Dark Data, the higher their progress will be over the next few years.
Ok, how should companies initiate this process then? The first step should be to plan the company’s overall data strategy, followed by the development of a two-to-three-year data science (or analytics) roadmap, and subsequently deliver the execution of this roadmap in a phased manner. Advanced levels of Data Engineering and Machine Learning competencies are needed. Short-term failures are inevitable. The focus should be on gradual progress over time rather than a big-bang approach.
Value-creation from Dark Data pattern mining is a slow process despite enormous benefits. Hence, companies need to start this journey now to remain ahead of their peers.
Addendum:
Experienced bird-watchers (or birders) are known to excel in pattern mining. These birders train their brains to easily differentiate among the dozens of complex sounds (calls) that each species of bird produces, at different stages of their lives, in different places, and at different times. Their approach to pattern mining (for bird-watching) is to treat it as a form of science, i.e., constant experimentation and testing. For instance, they focus a lot on understanding pattern differences, and not just on pattern similarities. More details are available in The Sibley Guide to Birds by David Sibley, one of the best-known books for ornithologists.
Businesses often end up with sub-optimal results from their Dark Data pattern mining programs. Apart from the challenges/limitations stated earlier in the paper, a major reason for sub-optimal results is the fact that many businesses still treat pattern mining more as an art form, and less of a science form. Bird-watching techniques offer an excellent example for designing best-in-class pattern mining approaches.