February 19, 2019

January 11, 2019

Please reload

Recent Posts

Asset Health Indices

September 25, 2017

1/2
Please reload

Featured Posts

A Data Scientific Method

February 19, 2019

How to take a pragmatic and goal-driven approach to data science.

 

The main aim of data science is simple: it is to extract value from data. This value could be of different forms in different contexts — but, usually, it comes in the form of better decisions.

 

 

As we venture ever further into the 21st century, the role that data plays in the decisions we make is becoming ever larger. This is because of the sheer volumetric increase in available data, as well as the improvements in the tools we can use to store, process and analyse it.

 

However, in the grander scheme of things, the field of data science is still in its infancy. It is a field that has emerged at the intersection of several other disciplines — statistics, business intelligence, and computer science to name a few; as these fields are subject to rampant evolution, so more so is data science.

 

Therefore, it is important to formulate clearly an approach to making better decisions from data, so that it may be applied to new problems methodically; sure, the process may start out as ‘firing shots into the dark’ — but you at least want your shots to become more accurate over time; you want your process to improve with each project.

 

At Gaussian Engineering we have done a number of projects with the express aim of extracting value from data; this post will attempt to document some of our learnings and formulate a process for doing data science; an inspiration for our approach is the time-and-tested scientific method…

 

The Scientific Method

The scientific method is a procedure that has characterised the field of natural science since the 1700s; it consists of a series of systematic steps — which ultimately aim to either validate or reject a statement (hypothesis).

 

 

The steps go something like this:

 

1. Observe → Make an observation

2. Question → Ask questions about the observation, gather information

3. Hypothesise → Form a hypothesis — a statement that attempts to explain the observation, make some predictions based on this hypothesis

4. Test → Test the hypothesis (and predictions) using a reproducible experiment

5. Conclude → Analyse the results and draw conclusions, thereby accepting or rejecting the hypothesis

6. Redo → The experiment should be reproduced enough times to ensure no inconsistency between observations/results and theory.

 

As an example, imagine that you have just gotten home from school or work; you turn on your bedroom light, and nothing happens! How could we use the Scientific Method to determine the problem?

 

 

 

Science is a methodology for increasing understanding; and the scientific method can be seen as an iterative method to standardise the process of conducting experiments, so that all experiments may produce more valuable, reliable results — and therefore, better understanding.

 

In a similar manner, we would like a standardised methodology for data science; that is, a method which prioritises the obtaining of information that is relevant to the goal of the analysis.

 

 

"If it disagrees with experiment, it's wrong. In that simple statement is the key to science" - Richard P. Feynman

 

 

The Data Scientific Method

 

 

At our organisation, Gaussian Engineering, we have come to a method which we feel works well for our projects. Like the scientific method, it is made up of 6 stages:

 

1. Identify 

2. Understand

3. Process

4. Analyse

5. Conclude

6. Communicate

 

These stages will be explained in more detail, and I will list some of the tools/methodologies we use during each stage (our team programs in Python, and uses various open-source tooling, so excuse my bias in this area).

 

Identify

The “identify” stage is concerned with the formulation of the goal of the data science project; it could also be called the “planning” stage.

 

We find it immeasurably helpful to get a very clear sense of what we are trying to achieve through analysing the dataset in question. To borrow a term from the PAS 55 Physical Asset Management Standard, we try to ensure that our team has a ‘clear line of sight’ to the overall objectives of the project.

 

During this stage, we ask questions like:

  • What decisions need to be made from this data?

  • What questions do we wish to answer?

  • Can we formulate hypotheses relating to these questions? What are they?

  • How much time do we have for the exploration?

  • What decisions would the stakeholder like to make from this data?

  • What would the ideal result look like?

  • How are we to export and present the final results?

Some useful tools/methodologies for the ‘identify’ stage:

  • Workshops/brainstorming sessions

  • Formulation of a designated space to keep related documents and findings together (SharePoint site, Dropbox folder etc.)

 

Understand

This “understand” stage is all about getting a general feel for the data itself.

Before you start losing ourselves in the details; diving into the various data sources; filtering on various fields, and walking the fine line between ‘value-added work’ and ‘analysis paralysis’, it is useful to ensure that our team has a bigger picture understanding of what is there.

 

During this stage, we ask questions like:

  • What is the size of the data?

  • How many files are there?

  • To what extent does the data originate from different sources?

  • Automated exports or manual spreadsheets?

  • Does the data have consistent formats (dates, locations etc.)? 

  • What is the overall data quality? In terms of the 6 dimensions of data quality?

  • What is the level of cleaning required?

  • What do the various fields mean?

  • Are there areas in which bias could be an issue?

 

 

 

Understanding the aspects of your data, such as its overall size, can aid you in deciding how to go about your analyses; for smaller data you may wish to do all of your analyses in memory — using tools like Python, Jupyter, and Pandas, or R; for larger data you may be better off moving it into an indexed, SQL database (for larger data still, Hadoop and/or Apache Spark become options). 

 

What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first. This is especially helpful in projects where there are strict time constraints.

 

What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first. This is especially helpful in projects where there are strict time constraints.

 

Some useful tools/methodologies for the ‘understand’ stage:

  • Workshops/brainstorming sessions

  • Python

  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualisations)

  • Numpy and Pandas (Python libraries)

  • Matplotlib and Seaborn (Python visualisation libraries that can help with the viewing of missing data)

  • R (a programming language geared towards statistics)

 

Process

This ‘process’ stage is all about getting your data into a state that is ready for analyses.

The words ‘cleaning’, ‘wrangling’ and ‘munging’ come to mind. 

 

A useful phenomenon to put to you here is the Pareto Principle — or ‘80/20 Rule’:

 

 “for many events, roughly 80% of the effects come from 20% of the causes” 

— Vilfredo Pareto

 

 

The ‘process’ stage can often take up the most time; in light of the Pareto Principle, it is important to prioritise what aspects of the data you devote most time to; you want to focus on what you think is the most important first, and come back to secondary fields only if necessary and if there is time to do so.

 

During this stage, we may do any or all of the following:

  • Combine all data into a single, indexed database (we use PostgreSQL)

  • Identify and remove data that is of no relevance to the defined project goal 

  • Identify and remove duplicates

  • Ensure that important data is consistent in terms of format (dates, times, locations)

  • Drop data that is clearly not in-line with reality, these are outliers that are unlikely to be real data

  • Fix structural errors (typos, inconsistent capitalisation)

  • Handle missing data (NaNs and nulls — either by dropping or interpolation, depending on the scenario)

The purpose of this stage is really to make your life easier during the analyses stage; processing data usually takes a long time and can be relatively tedious work, but the results are well worth the effort.

 

Some useful tools/methodologies for the ‘process’ stage:

  • MySQL, SQLite or PostgreSQL

  • Python

  • Numpy and Pandas (Python libraries)

  • Matplotlib and Seaborn (Python visualisation libraries that can help with the viewing of missing data)

  • NLTK (Natural Language Processing Toolkit — another Python library)

     

Analyse

This stage is concerned with the actual analyses of the data; it is the process of inspecting, exploring and modelling data — to find patterns and relationships that were previously unknown.

 

In the data value chain, this stage (along with the previous stage) is where the most significant value is added to the data itself. It is the transformative stage that changes the data into (potentially) usable information.

In this stage you may want to visualise your data quickly, attempting to identify specific relationships between different fields. You may want to explore the disparity of fields by location, or over time.

Ideally, in the identify stage, you would have come up with several questions relating to what you would like to get out of this data, and perhaps have even stated several hypotheses — this is then the stage where you implement models to confirm or reject these hypotheses.

 

During this stage, we may do any to all of the following:

  • If there is time-based data, explore whether there exist trends in certain fields over time — usually using a time-based visualisation software such as Superset or Grafana

  • If there is location-based data, explore the relationships of certain fields by area — usually using mapping software such as Leaflet JS, and spatial querying (we use PostgreSQL with PostGIS)

  • Explore correlations (r values) between different fields

  • Classify text using natural language processing methods (such as the bag of words model)

  • Implement various machine learning techniques in order to identify trends between multiple variables/fields — regression analyses can be useful

  • If there are many variables/fields, dimensionality reduction techniques (like Principle Component Analyses) can be used to reduce these to a smaller subset of variables that retain most of the information

  • Deep learning and neural networks have much potential, especially for much larger, structured datasets (though we have not yet made substantial use of this)

The analyses stage is really the stage where the rubber meets the road; it also illustrates the more sexy side of data science.

 

Some useful tools/methodologies for the ‘analyses’ stage: 

 

  • mySQL, SQLite or PostgreSQL (for querying, including spatial querying — for SQLite, see SpatiaLite)

  • JetBrains DataGrip (Pycharm IDE)

  • Datasette (a tool for exploring and publishing data)

  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualisations)

  • SciPy (Python library for advanced calculations)

  • NumPy & Pandas (Python data analyses/manipulation libraries)

  • Scikit-Learn (Python machine learning library)

  • Tensor Flow (Python machine learning library generally used for deep learning and neural networks)

  • Keras (Python library for fast experimentation with neural networks)

     

     

Conclude

This stage is concerned with drawing solid, valuable conclusions from the results of the analyses phase. This is the phase in which you can formulate clear answers to your questions; it is the phase in which you can either prove or disprove your hypotheses. It is also the stage in which you can use your conclusions, to generate actionable items to aid in the pursuit of the goal (if appropriate).

We usually aim to create a list of conclusions or ‘findings’ that have come out of the analyses and a subsequent list of recommended actions based on these findings. The actions should be listed with your target audience in mind: they want to know succinctly what was found and what they can do with/about it.

 

In this phase we may do any to all of the following: