# Thought Leader Thursday: What’s in my Designs’ Future? Seven Ways to Read your Data to Learn It

Recently, when a customer asked me if I can review their design exploration data to see how robust their design will be I felt like I was asked to be a tassologist for data. In Turkey, after drinking coffee in small ornamental cups, people turn them upside down and wait few minutes for coffee grinds to develop patterns inside the cup. Then they pass it to a friend-turned-tassologist to get a reading of their future from these patterns. Unlike tassology, understanding patterns in design exploration data is a mathematical process; but conveying those patterns to an audience requires story building skills just like reading patterns in coffee cups does.

The user data in question is 12971 run data that had 6 input variables, 18 output variables (will be referred as Key Performance Indices, KPI in the rest of the article). Below are seven things we learned as we were reading the patterns in the data.

1. Review the Health and Quality of Data before Making Conclusions from it

Once I have the data, I too am very excited to make some conclusions from it and make suggestions for design improvements. However, we have to hold on to that urge and review the health and quality of our data. Many data issues will raise a red flag but some may not.

At first there is nothing wrong with this data’s health as the “No Values” and “Bad Values” columns are all zero. Those are the columns that would point to an issue with values in the data. However, when looked carefully, you can see that the ranges for KPI 10, 11, 12, 13 are in the order of 10-6. Such small numbers beg a question on the validity of this data. Could these KPIs be measured to that accuracy or are these values just noise or is there a mistake in data entry? If they are valid data, should they go through a treatment such as a logarithmic treatment before data analysis? It could be either depending on the quantity measured and how it is measured.

In the quality table, we see that there are up to 511 outliers. This corresponds to 4% of the designs which is not a large percentage but it requires a review of these outliers next to see why they occur and when.

2.  Review the Box Plots for Outliers

From here on, we will focus on KPIs 3, 4, 14 and 17.

In the image below, we can see the outliers in two of the KPIs; 4 and 14, indicated with red dots. We can find the outlier designs and investigate them. Outliers have values that are significant enough different than the rest of the population. In this case all outliers have much larger values than the rest which is undesirable as for these KPI’s lower values are preferred. This takes us to the histograms.

3. Review the Histograms for Distributions

Using histograms, we can see whether the data distribution points to any potential problems in fulfilling design requirements. For example, are lower or higher KPI values preferred? Is the distribution preferred to be flat for reliability? Is it bimodal which may point to different failure mechanisms?

In this case for all KPIs lower the value better the performance is. Histograms below are promising as most design values are in the lower ranges. Next we need to study the relations between the KPIs to make sure lower values occur for the same designs.

4. Parallel Coordinate Plots

Parallel coordinate plots are good at identifying patterns in design. When all inputs and outputs are plotted, they may be overwhelming but you can isolate variables and responses to get clear pictures for the patterns between them.

In this case, we are looking at high values for KPIs 14 and 17. It seems when KPI 14 values are high KPI 4 values are low. Similarly, when KPI17 values are high, KPI3 values are low. This is not desirable as lower is better for all KPIs. Next we will look into correlations to get a detailed view of the relations between these two output responses and also see if there are other similar relations.

5. Correlations

Using the parallel coordinate plots we have identified a high inverse correlation between KPIs 4 and 14. However the correlation coefficient for these two output responses is only -0.28. A high inverse correlation coefficient should be closer to -1. So we may wonder why the correlation coefficient is low despite our previous observation.

In the scatter plot below, we can observe that the correlation between KPIs 4 and 14 that are observed in parallel coordinate plots exist only for very high values of KPI 14; meaning for all very high values of KPI14; KPI4 has low values; and hence no data points on the upper right corner of the plot. The locality of high correlation leads to a lower correlation coefficient for these two output responses.

Correlation coefficient between KPIs 2 and 3 is 0.99, which is close to a perfect 1. This means that they have a high positive correlation. We can also observe this in the shape of the scatter plot for these two output responses.

KPI 17 is negatively proportional to many responses; including KPI3.

6. Pareto Plots

To resolve possible issues with these high inverse correlations, next we need to learn about the dependence of the responses on the design variables. We can use Pareto plots for that purpose. In these plots, we see that more than 80% of the variations in KPIs 3, 4, 17 are due to design variables 4 and 1. KPI 14 is significantly affected by design variables 2 and 5. This is good news as we can fine-tune KPI14 while not significantly changing KPI4.