Data science employs scientific methods to extract information from data.
Data science is multi-disciplinary. Setting up the processes and systems used to collect data generally fall on the computer scientists, who employ algorithms developed my mathematicians to gain insight into the relationships between objects or events. It has many practical applications across different business disciplines that can help people make more informed decisions. In this article I explain how a business user may think through a problem, though the eyes of a data scientist, to help understand where data science can be of benefit and where to exercise caution in relying on it.
Framing the Right Question
Data scientists use algorithms to try to answer questions using data. Not all questions can be answered with data, so it is important to understand how to frame a question in a way that an algorithm can be used to support a decision.
As a simple example, a sales department could ask a broad, open-ended question, such as “how can we get more customers?” They could alternatively frame it as a pointed question such as “are we generating more prospects from advertisement A or advertisement B?”
Reframing the question allows the organization to use data to support their decision-making process.
Selecting the Right Algorithm
In order to frame a question correctly it is important to understand the different flavors of algorithms available to data scientists. In business use cases, we can think about algorithms as being grouped in one of these categories:
- Classification Algorithms – Should I choose one thing or another, or which option should I choose from several options?
- Anomaly Detection Algorithms – Does something not look correct in this data set? Is there an anomaly?
- Regression Algorithms – How much cash will I have in the bank next Friday or how many customers will come into the store tonight?
- Clustering Algorithms – How is this information best organized? Are there logical sub groupings that can help us gain better insight into the data set?
- Reinforced Learning Decisions – What should I do with this new information that has come to my attention?
Preparing the Data
Once you have framed the question correctly and determined the algorithm that is going to help you answer that question, you must determine whether you have collected the data required to fuel the engine, so to speak. In many real-world scenarios this is where a cost-benefit analysis may end a prospective project. Often the cost of cleansing historic data exceeds the benefit of the insight garnered from it.
Although a basic calculus courses explains the concepts of differential and integral analysis, in real world applications algorithms that answer questions are multi-variant. A study of the limits and continuity in multi-variable calculus yields many counter-intuitive results not demonstrated by single-variable functions. This is a complex way of saying that it is critical to capture all significant contributing data points. For instance, one cannot evaluate the growth of a bacteria population by monitoring humidity alone because temperature also impacts the growth pattern.
Interpreting the Results
Real world applications require multi variant analysis and an analyst that is conscious of missing data points or noise introduced by incomplete or inaccurate data sets. The analyst must evaluate the precise and accuracy of the results.
A data set and algorithm may produce an output that is highly focused and consistent, yet consistently wrong. This result may be precise, but it is inaccurate. In other cases, a data set and algorithm may produce a result which is highly accurate, but highly distributed. In this case we may have an accurate answer. It is technically not ‘wrong’, but it is not precise enough for us to draw a conclusion.
Data scientists can apply different algorithms to the data set, but if the set is incomplete or inaccurate, the results may be misinterpreted or incomplete. As we have pointed out previously, without big data, there is no artificial intelligence. With the wrong data, or the wrong interpretation of the output, you may end up with an inaccurate conclusion to your question.