Finding Structure In An Unstructured World (Of Data)
June 29, 2018
The practice of medicine builds upon a foundation based on observations and findings from past cases and research projects. A scientific process of testing novel discoveries that are then translated into new standards of care if they have positive outcomes – and the ongoing monitoring of current standards of care that encourage continuous improvements and newer discoveries – is the basis of a powerful and evidence-based learning system that ensures that we continue to make advances in how healthcare is delivered.
However, to do this, it is important that we have access to data that can be analyzed in an appropriate manner to generate meaningful observations. Since our findings within limited clinical trial groups do not always translate as expected when we expand beyond, the more avenues that we have of sharing our findings and testing our hypotheses against larger datasets, the more convinced we can be of its reliability, safety and application to the broader population. There has been tremendous growth in not only the number of information sources that provide great value in the conduct of daily operational workflows at institutions, but also the amounts of data that these systems generate. This creates value as well as challenges.
Compounding the challenge is the fact that data from multiple sources generally have varying data models and ontologies and are often comprised of both structured and unstructured data. Unstructured data references data that does not have a pre-defined data model and therefore requires handling which is quite different from that of structured data which is the output of more typical relational databases. Examples of unstructured data includes notes, documents, and images. In fact, it is estimated that an astounding 80% of clinical information is stored in the form of unstructured data. While it holds a wealth of information, unstructured data has traditionally been a challenging dataset to analyze.
So, on the one hand, there is an explosion of data and more data than ever before is available to our healthcare community, and on the other hand we are facing a daunting task of coming up with efficient methods of making meaningful use of all this information. The challenge is transforming from one of “how do we collect appropriate data to help us make scientific progress” to one of “how do we process and analyze all the data that we now have to help us make scientific progress”. And to do this in a way that is timely, reliable and repeatable.
To facilitate analysis of these massive amounts of data with disparate types and sources, institutions often build data warehouses which can be a long and expensive process, only to have to change and adapt the warehouse structure as their internal systems, or external regulations and standards, evolve or change. In addition, data storages at one institution are typically not aligned with those at other institutions, thereby limiting the value that could be obtained from a system of networked data storages with far greater insight generated from a much richer data-pool. But there are emerging technologies that are being applied by forward-thinking institutions to facilitate working with this kind of diverse data-pool. Some factors that are contributing towards this momentum are:
1) Better data storage options: There are a lot more options available to institutions today enabling data storage of both structured and unstructured data in a manner that efficiently leverages methods for processing and analyzing across the combined data-pool. Cloud platforms have made this an easier and more economical option for institutions of all types and sizes.
2) More efficient opportunities for integration: A federated approach is often an efficient option to be able to leverage data across multiple sources while allowing the individual systems to remain optimized to address their respective functions. Opportunities for system integration has increased across the board, helped in part by more widespread availability of Application Programming Interfaces (APIs) and continuing evolution of global data exchange standards.
3) Improved data processing technologies: Newer storage and processing technologies for Big Data e.g. Optical Character Recognition (OCR), Natural Language Processing (NLP), machine learning and others, make it easier to rapidly process large amounts of data while also integrating patterns and information from unstructured data into decision-making pathways and other analytical solutions.
Despite the advances being made in this area, bringing structure to a diverse data-pool in a manner that creates value, remains a task that can easily become a highly complex project. The key is to approach it in a step wise manner. It is important to identify specific problems you are looking to tackle in the near future, then identify the systems that will serve as the source for this data. Additional systems and data sets should be integrated in a gradual manner. Creating a federated network of information and applying processing technologies to the data-pool allows illuminating patterns and trends to begin emerging, with valuable insights to be gained that begin guiding you into the future.