# Predictive Modeling in Tableau

Predictive modeling functions use linear regression for building predictive models and generate predictions about your data. Two table calculation functions, MODEL_PERCENTILE and MODEL_QUANTILE, can develop predictions and surface relationships with your data. These can also be used to identify outliers, estimate values for sparse or missing data, and predict future periods.

With the help of predictive modeling functions, you can select targets and predictors by updating the variables and visualizing multiple models with different predictors. The data thus can be filtered, aggregated, and transformed at any level of detail, and thus the prediction will automatically recalculate to match your data.

**What is MODEL_QUANTILE?**

It calculates the posterior predictive quantile or the expected value at a specified quantile.

- Quantile: This is the first argument, and it contains a number between 0 and 1, indicating what quantile should be predicted. For instance, 0.5 specifies that the median will be predicted.
- Target expression: This is the second argument, and it is the measure to predict or target.
- Predictor expression(s): This is the third argument, and it is the predictor used to make the prediction. Predictors can be dimensions, measures, or both.

The result is always a number within the probable range.

You can also use MODEL_QUANTILE to generate a confidence interval, missing values like future dates, or to generate categories that do not exist in your underlying data set.

**What is MODEL_PERCENTILE?**

It calculates the posterior predictive distribution function, also known as the Cumulative Distribution Function (CDF). This is used to calculate the quantile of a particular value between 0 and 1, the inverse of MODEL_QUANTILE.

- Target expression: This is the first argument, and it is the measure to target, identifying which values to assess.
- Predictor(s): This is the second argument, and it is the predictor used to make the prediction.

Additional arguments are optional and are only included to control the prediction.

Notice that the calculation syntax is similar to the MODEL_QUANTILE having the extra argument of a defined quantile.

The result is always the probability of the expected value being less than or equal to the observed value expressed in the mark.

You can also use MODEL_PERCENTILE to surface correlations and relationships within your database. If MODEL_PERCENTILE returns a value that is close to 0.5, the observed mark is near to the median of the range of predicted values. If MODEL_PERCENTILE returns a value that is close to 0 or to 1, the observed mark is near to the lower or upper range of what the model expects.

**How to build your predictive modeling function?**

- Open the Analysis menu, and then select Create Calculated Field.
- In the Calculation Editor, name the calculation and perform the following:

- Name the calculation as:

Quantile_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region

- Enter the below formula:

MODEL_QUANTILE(0.5,MEDIAN([Life Expectancy Female]),

LOG(MEDIAN([Health Exp/Capita])),

MEDIAN([Birth Rate]),

MEDIAN([Infant Mortality Rate]),

ATTR([Region]))

The above calculation returns the median value (0.5) of the range of modeled median female life expectancies, based on the predictors selected: Health Expenditure, Birth Rate, Infant Mortality, and Region.

Next, let us build a scatter plot that shows both the actual median female life expectancy and the predicted median female life expectancy:

Let us do another pass to figure out where the predictions are furthest off the mark. Create another calculation named Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region, as follows:

MEDIAN([Life Expectancy Female]) – [Quantile_LifeExpFemale_HealthExpend,BirthRate,Mortality,Region]

The above residual calculation returns the difference between the predicted median and the actual median, helping us see the countries with the biggest discrepancy between the actual and predicted median female life expectancy.

Next, let us apply this residual calculation to Color:

In the above viz, most countries, across most regions, have low discrepancies between the predictions and actual values. Africa is the region having the largest number of countries with significant discrepancies, but let us do one more round to observe what kind of difference we are looking for.

You can see that the difference range between -17 and +9, so let us divide the viz into groups where there is less than ±3 years difference, less than ±5 years difference, less than ±10 years difference, and more than ±10 years difference.

Create one more calculation field named Grouped_Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region, as follows:

IF [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] <= 3

AND [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] >= -3

THEN “±3”

ELSEIF [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] <= 5

AND [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] >= -5

THEN “±5”

ELSEIF [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] <= 10

AND [Residual_LifeExpFemale_HealthExpend, BirthRate, Mortality, Region] >= -10

THEN “±10”

ELSE

“> ±10”

END

Again, let us add the calculation to Color:

You will notice that the vast majority of predictions are incorrect by less than three years and that only a small are off by more than ten years.

**Predictive Modeling Functions in Time Series Visualizations**

It describes the steps required to extend a time axis into the future so that you can make predictions for future dates using predictive modeling functions.

Step 1: Create a visualization

- In Tableau Desktop, connect to the Sample – Superstore saved data source, which comes with Tableau.
- Navigate to Sheet 1 worksheet.
- From the Data pane, drag the Order Date dimension to the Columns shelf.
- Open the measure’s context menu to change its list level to Month and Year:

- Drag Profit to the Rows shelf.

Step 2: Create a prediction calculation

- Click to open the Analysis menu at the top, and then select Create Calculated Field.
- In the Calculation Editor, do the following:

- Name the calculation:

Quantile Predict Median Profit.

- Enter the following formula:

MODEL_QUANTILE(0.5, SUM([Profit]),ATTR(DATETRUNC(‘month’, [Order Date])))

The MODEL_QUANTILE function takes a given quantile and predicts values based on the predictors you input. In this case, the quantile = 0.5, which predicts the median. We want to predict profit, so the target expression is SUM([Profit]). And since we want to base the prediction on past performance, we need to include the date as a predictor.

- When finished, click on OK.

The above prediction calculation is now added as a calculated field in the Data pane.

Step 3: Add the prediction calculation to the view

- Drag the prediction calculation to the Rows shelf, to the right of SUM(Profit).
- Right-click the measure and select Dual Axis.
- For aligning the two axes in a dual axes chart to use the same scale, right-click the secondary axis, in this case, Quantile Predict Median Profit, and select Synchronize Axis. This aligns with the scale of the two axes.

Step 4: Extend the date axis with a calculated field

The next step now is the creation of a calculation that moves the last date point into the future.

- Click to open the Analysis menu at the top, and then select Create Calculated Field.
- In the Calculation Editor, do the following:

- Name the calculation: Future Date.
- Enter the following LOD expression:

IF [Order Date] = {FIXED : MAX([Order Date])} THEN DATEADD(‘month’, 12, [Order Date]) ELSE [Order Date] END

Here’s the expression is broken down as:

The DATE wrapper ensures that we end up with a date field.

The [Order Date] = {FIXED: MAX([Order Date])} clause tests each value of [Order Date] to see if it’s the same as the maximum value—that is, the last point. FIXED is a level of detailed expression that allows us to compare between the unaggregated [Order Date] on the left and the aggregated MAX([Order Date]) on the right. On a different viz, you might want to put dimensions after the FIXED and before the colon to fix certain dimensions without referencing the detail level. The unit of time and the number of those units can also be parameterized to be changed flexibly.

Now, replace the [Order Date] measure on the Columns shelf with the new calculated field we just created. Be sure to change its list level to Month and Year, as before.

Step 5: Exclude the last data point

Since we don’t know the value of SUM([Profit]) in the future, let’s create another calculated field using the same logic as above, but with the last point replaced with NULL. This will allow the prediction calculation to compute the last point instead of merely carrying the existing data forward.

- Click to open the Analysis menu at the top, and then select Create Calculated Field.
- In the Calculation Editor, do the following:

- Name the calculation: Predict Future Profit.
- Enter the following LOD expression:

IF [Order Date] = {FIXED : MAX([Order Date])} THEN NULL ELSE [Profit] END

- Now, replace the SUM(Profit) measure on the Columns shelf with the new calculated field, Predict Future Profit, we just created.
- As before, right-click the measure and select Dual Axis.
- And again, right-click the secondary axis and select Synchronize Axis to align the two axes’ scale.

Step 6: Update the prediction calculation

Next, let’s update the prediction calculation with these changes to compute the predicted profit on future dates.

- Open the Quantile Predict Median Profit calculation we created earlier. In the Data pane, right-click the calculated field and select Edit.
- Update the target expression and predictor expression in the formula as follows:

MODEL_QUANTILE(0.5, SUM([Profit for Prediction]),ATTR(DATETRUNC(‘month’, [Future Date])))

Step 7: Densify the data

In this example, we still need to tell Tableau to add predictions for future dates. This is known as data densification, which occurs when Tableau creates additional marks in the view to compensate for missing values. Data is not added to the underlying data source; marks are generated and added to the view.

- Right-click MONTH(Future Date) in the Columns shelf and select Show Missing Values.
- Open the Analysis menu and select Infer Properties from Missing Values. This ensures that our prediction calculations will work on the missing values.

Now, the visualization includes predictions for future dates.