©2017 Gaussian Engineering

GET IN TOUCH
  • White Facebook Icon
  • White LinkedIn Icon

info@gauseng.com | +27 87 231 0367

5 Otto Cl, Westlake, Cape Town, 7945

February 19, 2019

January 11, 2019

Please reload

Recent Posts

Asset Health Indices

September 25, 2017

1/2
Please reload

Featured Posts

A Data Scientific Method

February 19, 2019

How to take a pragmatic and goal-driven approach to data science.

 

The main aim of data science is simple: it is to extract value from data. This value could be of different forms in different contexts — but, usually, it comes in the form of better decisions.

 

 

As we venture ever further into the 21st century, the role that data plays in the decisions we make is becoming ever larger. This is because of the sheer volumetric increase in available data, as well as the improvements in the tools we can use to store, process and analyse it.

 

However, in the grander scheme of things, the field of data science is still in its infancy. It is a field that has emerged at the intersection of several other disciplines — statistics, business intelligence, and computer science to name a few; as these fields are subject to rampant evolution, so more so is data science.

 

Therefore, it is important to formulate clearly an approach to making better decisions from data, so that it may be applied to new problems methodically; sure, the process may start out as ‘firing shots into the dark’ — but you at least want your shots to become more accurate over time; you want your process to improve with each project.

 

At Gaussian Engineering we have done a number of projects with the express aim of extracting value from data; this post will attempt to document some of our learnings and formulate a process for doing data science; an inspiration for our approach is the time-and-tested scientific method…

 

The Scientific Method

The scientific method is a procedure that has characterised the field of natural science since the 1700s; it consists of a series of systematic steps — which ultimately aim to either validate or reject a statement (hypothesis).

 

 

The steps go something like this:

 

1. Observe → Make an observation

2. Question → Ask questions about the observation, gather information

3. Hypothesise → Form a hypothesis — a statement that attempts to explain the observation, make some predictions based on this hypothesis

4. Test → Test the hypothesis (and predictions) using a reproducible experiment

5. Conclude → Analyse the results and draw conclusions, thereby accepting or rejecting the hypothesis

6. Redo → The experiment should be reproduced enough times to ensure no inconsistency between observations/results and theory.

 

As an example, imagine that you have just gotten home from school or work; you turn on your bedroom light, and nothing happens! How could we use the Scientific Method to determine the problem?

 

 

 

Science is a methodology for increasing understanding; and the scientific method can be seen as an iterative method to standardise the process of conducting experiments, so that all experiments may produce more valuable, reliable results — and therefore, better understanding.

 

In a similar manner, we would like a standardised methodology for data science; that is, a method which prioritises the obtaining of information that is relevant to the goal of the analysis.

 

 

"If it disagrees with experiment, it's wrong. In that simple statement is the key to science" - Richard P. Feynman

 

 

The Data Scientific Method

 

 

At our organisation, Gaussian Engineering, we have come to a method which we feel works well for our projects. Like the scientific method, it is made up of 6 stages:

 

1. Identify 

2. Understand

3. Process

4. Analyse

5. Conclude

6. Communicate

 

These stages will be explained in more detail, and I will list some of the tools/methodologies we use during each stage (our team programs in Python, and uses various open-source tooling, so excuse my bias in this area).

 

Identify

The “identify” stage is concerned with the formulation of the goal of the data science project; it could also be called the “planning” stage.

 

We find it immeasurably helpful to get a very clear sense of what we are trying to achieve through analysing the dataset in question. To borrow a term from the PAS 55 Physical Asset Management Standard, we try to ensure that our team has a ‘clear line of sight’ to the overall objectives of the project.

 

During this stage, we ask questions like:

  • What decisions need to be made from this data?

  • What questions do we wish to answer?

  • Can we formulate hypotheses relating to these questions? What are they?

  • How much time do we have for the exploration?

  • What decisions would the stakeholder like to make from this data?

  • What would the ideal result look like?

  • How are we to export and present the final results?

Some useful tools/methodologies for the ‘identify’ stage:

  • Workshops/brainstorming sessions

  • Formulation of a designated space to keep related documents and findings together (SharePoint site, Dropbox folder etc.)

 

Understand

This “understand” stage is all about getting a general feel for the data itself.

Before you start losing ourselves in the details; diving into the various data sources; filtering on various fields, and walking the fine line between ‘value-added work’ and ‘analysis paralysis’, it is useful to ensure that our team has a bigger picture understanding of what is there.

 

During this stage, we ask questions like:

  • What is the size of the data?

  • How many files are there?

  • To what extent does the data originate from different sources?

  • Automated exports or manual spreadsheets?

  • Does the data have consistent formats (dates, locations etc.)? 

  • What is the overall data quality? In terms of the 6 dimensions of data quality?

  • What is the level of cleaning required?

  • What do the various fields mean?

  • Are there areas in which bias could be an issue?

 

 

 

Understanding the aspects of your data, such as its overall size, can aid you in deciding how to go about your analyses; for smaller data you may wish to do all of your analyses in memory — using tools like Python, Jupyter, and Pandas, or R; for larger data you may be better off moving it into an indexed, SQL database (for larger data still, Hadoop and/or Apache Spark become options). 

 

What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first. This is especially helpful in projects where there are strict time constraints.

 

What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first. This is especially helpful in projects where there are strict time constraints.

 

Some useful tools/methodologies for the ‘understand’ stage:

  • Workshops/brainstorming sessions

  • Python

  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualisations)

  • Numpy and Pandas (Python libraries)

  • Matplotlib and Seaborn (Python visualisation libraries that can help with the viewing of missing data)

  • R (a programming language geared towards statistics)

 

Process

This ‘process’ stage is all about getting your data into a state that is ready for analyses.

The words ‘cleaning’, ‘wrangling’ and ‘munging’ come to mind. 

 

A useful phenomenon to put to you here is the Pareto Principle — or ‘80/20 Rule’:

 

 “for many events, roughly 80% of the effects come from 20% of the causes” 

— Vilfredo Pareto

 

 

The ‘process’ stage can often take up the most time; in light of the Pareto Principle, it is important to prioritise what aspects of the data you devote most time to; you want to focus on what you think is the most important first, and come back to secondary fields only if necessary and if there is time to do so.

 

During this stage, we may do any or all of the following:

  • Combine all data into a single, indexed database (we use PostgreSQL)

  • Identify and remove data that is of no relevance to the defined project goal 

  • Identify and remove duplicates

  • Ensure that important data is consistent in terms of format (dates, times, locations)

  • Drop data that is clearly not in-line with reality, these are outliers that are unlikely to be real data

  • Fix structural errors (typos, inconsistent capitalisation)

  • Handle missing data (NaNs and nulls — either by dropping or interpolation, depending on the scenario)

The purpose of this stage is really to make your life easier during the analyses stage; processing data usually takes a long time and can be relatively tedious work, but the results are well worth the effort.

 

Some useful tools/methodologies for the ‘process’ stage:

  • MySQL, SQLite or PostgreSQL

  • Python

  • Numpy and Pandas (Python libraries)

  • Matplotlib and Seaborn (Python visualisation libraries that can help with the viewing of missing data)

  • NLTK (Natural Language Processing Toolkit — another Python library)

     

Analyse

This stage is concerned with the actual analyses of the data; it is the process of inspecting, exploring and modelling data — to find patterns and relationships that were previously unknown.

 

In the data value chain, this stage (along with the previous stage) is where the most significant value is added to the data itself. It is the transformative stage that changes the data into (potentially) usable information.

In this stage you may want to visualise your data quickly, attempting to identify specific relationships between different fields. You may want to explore the disparity of fields by location, or over time.

Ideally, in the identify stage, you would have come up with several questions relating to what you would like to get out of this data, and perhaps have even stated several hypotheses — this is then the stage where you implement models to confirm or reject these hypotheses.

 

During this stage, we may do any to all of the following:

  • If there is time-based data, explore whether there exist trends in certain fields over time — usually using a time-based visualisation software such as Superset or Grafana

  • If there is location-based data, explore the relationships of certain fields by area — usually using mapping software such as Leaflet JS, and spatial querying (we use PostgreSQL with PostGIS)

  • Explore correlations (r values) between different fields

  • Classify text using natural language processing methods (such as the bag of words model)

  • Implement various machine learning techniques in order to identify trends between multiple variables/fields — regression analyses can be useful

  • If there are many variables/fields, dimensionality reduction techniques (like Principle Component Analyses) can be used to reduce these to a smaller subset of variables that retain most of the information

  • Deep learning and neural networks have much potential, especially for much larger, structured datasets (though we have not yet made substantial use of this)

The analyses stage is really the stage where the rubber meets the road; it also illustrates the more sexy side of data science.

 

Some useful tools/methodologies for the ‘analyses’ stage: 

 

  • mySQL, SQLite or PostgreSQL (for querying, including spatial querying — for SQLite, see SpatiaLite)

  • JetBrains DataGrip (Pycharm IDE)

  • Datasette (a tool for exploring and publishing data)

  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualisations)

  • SciPy (Python library for advanced calculations)

  • NumPy & Pandas (Python data analyses/manipulation libraries)

  • Scikit-Learn (Python machine learning library)

  • Tensor Flow (Python machine learning library generally used for deep learning and neural networks)

  • Keras (Python library for fast experimentation with neural networks)

     

     

Conclude

This stage is concerned with drawing solid, valuable conclusions from the results of the analyses phase. This is the phase in which you can formulate clear answers to your questions; it is the phase in which you can either prove or disprove your hypotheses. It is also the stage in which you can use your conclusions, to generate actionable items to aid in the pursuit of the goal (if appropriate).

We usually aim to create a list of conclusions or ‘findings’ that have come out of the analyses and a subsequent list of recommended actions based on these findings. The actions should be listed with your target audience in mind: they want to know succinctly what was found and what they can do with/about it.

 

In this phase we may do any to all of the following:

  • Cross check findings with original questions (‘identify’ phase) and determine what we have answered

  • Reject or accept the various hypotheses from the ‘identify’ phase

  • Prioritise conclusions/findings, which ones are most important to communicate to stakeholders — which are of most significance?

  • Attempt to weave conclusions together into some form of story

  • Identify follow up questions

  • Identify high-priority areas in which action will yield the most valuable results (Pareto Principle)

  • Develop recommendations and/or actions based on conclusions (especially in high-priority areas)

Some useful tools/methodologies for the ‘conclude’ stage:

  • Workshops/brainstorming sessions

  • Microsoft Office (Excel, PowerPoint, Word)

 

Communicate

Arguably the most important step in the Data Scientific Method is the ‘communicate’ phase; this is the phase in which you ensure that your client/audience/stakeholders understand the conclusions that you have drawn from their data.

 

They should also be presented with these in such a way that they can act on them — so if you do not recommend actions, the conclusions should then be presented so as to stimulate ideas for action, within them.

 

This is the phase in which you package your findings and conclusions in beautiful, easy-to-understand visualisations, presentations, reports and/or applications.

 

In this phase we may do any to all of the following:

  • If there is time-based data, create sexy time-series visualisations using packages like Grafana or Superset

  • If there is spatial data, create sexy map visualisations using packages like Leaflet JS, Plotly or Superset

  • Create statistical plots using D3.js, Matplotlib or Seaborn

  • Embed various visualisations into dashboards, and ensure these are shareable/portable (whether hosted or built as an application) — Superset is a great way to do this within an organisation 

  • Develop interactive visualisations using D3.js or Plotly

  • Develop interactive applications or SPAs (Single Page Applications) using web technologies such as Angular, Vue.js or React (or just vanilla JavaScript!)— link these up to the data using libraries such as Psycopg2 for PostgreSQL 

Some useful tools/methodologies for the ‘communicate’ stage:

  • Grafana (for time-series) 

  • Apache Superset (exploration and visualisation platform; allows for the creation of shareable dashboards; great for a variety of data sources, including SQL databases)

  • Matplotlib, Seaborn, and Bokeh (Python visualisation libraries — Seaborn is more for statistical visualisation, and is built on top of Matplotlib)

  • D3.js (A JavaScript library that directly links HTML to data, allowing for beautiful, interactive, and highly customisable in-browser visualisations)

  • Leaflet.js (A JavaScript library for creating interactive maps)

  • Plotly, Altair, and Pygal (Python libraries for interactive visualisations)

  • Jinja 2 (Python, HTML templating library — similar to Django templates)

  • Psycopg2 (PostgreSQL driver to facilitate database connections through Python)

  • Angular, Vue.js and React (SPA libraries/JavaScript frameworks)

  • Microsoft Office (Excel, Word, and PowerPoint) — for reporting

Information Should Result In Action 

Now it is all very well to go through the process as stated thus far; after all, it should result in some sound information. Really though, to realise any benefits of this data, something should be done with the information obtained from it!

 

Like the Scientific Method, ours is an iterative process, which should incorporate action…

 

So, to modify our diagram slightly:

 

 

Oftentimes we may also go through the six stages, and there won’t be time for action before we iterate once more. We may communicate findings that immediately incite further questions — and we may then dive right into another cycle. Over the long term, though, action is essential for making the entire exercise a valuable one. 

 

In our organisation, each new data science project consists of several of these cycles. Communication of results often sparks new discussions and opens up new questions and avenues for exploration; if our conclusions result in actions which yield favourable results? We know we are doing something right.

 

 

"Without data, you’re just another person with an opinion."

- W. Edwards Deming

 

This is the original version of this article, for the Medium version, click here. Thanks to Jaco Du Plessis for putting together the original steps for the Data Scientific Method (See on his GitHub)