How Data Science Can Benefit Nonprofits

Image Source: https://pixabay.com/vectors/pixel-cells-pixel-creative-commons-3704068/

Data science is the poster child of the 21st century and for good reason. Data-based decisions have streamlined, automated, and made businesses more efficient than ever before, and there are practically no industries that haven’t recognized its immense potential. But when you think of data science application, sectors like marketing, finance, technology, SMEs, and even education are the first that come to mind. There’s one more sector that’s proving to be an untapped market for data—the social sector. At first, one might question why non-profit organizations even need complex data applications, but that’s just it—they don’t. What they really need is data tools that are simple and reliable, because if anything, accountability is the most important component of the way non-profits run.

Challenges for Non-profits and Data Science

If you’re wondering why many non-profits haven’t already hopped onto the data bandwagon, its because in most cases they lack one big thing—quality data.

One reason is that effective data application requires clean data, and heaps of it, something non-profits struggle with. Most don’t sell products or services, and their success is reliant on broad, long-term (sometimes decades) results and changes, which means their outcomes are highly unmeasurable. Metrics and data seem out of place when appealing to donors, who are persuaded more by emotional campaigns. Data collection is also rare, perhaps only being recorded when someone signs up to the program or leaves, and hardly any tracking in between. The result is data that’s too little and unreliable to make effective change.

Perhaps the most important phase, data collection relies heavily on accurate and organized processes. For non-profits that don’t have the resources for accurate and manual record-keeping, clean, and quality data collection is a huge pain point. However, that is an issue now easily avoidable. For instance, avoiding duplicate files, adopting record-keeping methods like off-site and cloud storage, digital retention, and of course back-up plans—are all processes that could save non-profits time, effort, and risk. On the other hand, poor record management has its consequences, namely on things like fund allocation, payroll, budgeting, and taxes. It could lead to financial risk, legal trouble, and data loss — all added worries for already under-resourced non-profit organizations.

But now, as non-governmental organizations (NGOs) and non-profits catch up and invest more in data collection processes, there’s room for data science to make its impact. A growing global movement, ‘Data For Good’ represents individuals, companies, and organizations volunteering to create or use data to help further social causes ad support non-profit organizations. This ‘Data For Good’ movement includes tools for data work that are donated or subsidized, as well as educational programs that serve marginalized communities. As the movement gains momentum, non-profits are seeing data seep into their structures and turn processes around.

How Can Data Do Social Good?

With data science set to take the non-profit sector by storm, let’s look at some of the ways data can do social good:

  1. Improving communication with donors: Knowing when to reach out to your donors is key. In between a meeting? You’re unlikely to see much enthusiasm. Once they’re at home with their families? You may see wonderful results, as pointed out in this Forbes article. The article opines that data can help non-profits understand and communicate with their donors better.
  2. Donor targetting: Cold calls are a hit and miss, and with data on their side, non-profits can discover and define their ideal donor and adapt their messaging to reach out to them for better results.
  3. Improving cost efficiency: Costs are a major priority for non-profits and every penny counts. Data can help decrease costs and streamline financial planning
  4. Increasing new member sign-ups and renewals: Through data, non-profits can reach out to the right people they want on-board, strengthen recruitment processes and keep track of volunteers reaching out to them for future events or recruitment drives.
  5. Modeling and forecasting performance: With predictive modeling tools, non-profits can make data-based decisions on where they should allocate time and money for the future, rather than go on gut instinct.
  6. Measuring return on investment: For a long time, the outcomes of social campaigns have been perceived as intangible and immeasurable—it’s hard to measure empowerment or change. With data, non-profits can measure everything from the amount a fundraiser raised against a goal, the cost of every lead in a lead generation campaign, etc
  7. Streamlining operations: Finally, non-profits can use data tools to streamline their business processes internally and invest their efforts into resources that need it.

It’s true, measuring good and having social change down to a science is a long way off — but data application is a leap forward into a more efficient future for the social sector. With mission-aligned processes, data-driven non-profits can realize their potential, redirect their focus from trivial tasks, and onto the bigger picture to drive true change.

Introduction to Recommendation Engines

This is the second article of article series Getting started with the top eCommerce use cases. If you are interested in reading the first article you can find it here.

What are Recommendation Engines?

Recommendation engines are the automated systems which helps select out similar things whenever a user selects something online. Be it Netflix, Amazon, Spotify, Facebook or YouTube etc. All of these companies are now using some sort of recommendation engine to improve their user experience. A recommendation engine not only helps to predict if a user prefers an item or not but also helps to increase sales, ,helps to understand customer behavior, increase number of registered users and helps a user to do better time management. For instance Netflix will suggest what movie you would want to watch or Amazon will suggest what kind of other products you might want to buy. All the mentioned platforms operates using the same basic algorithm in the background and in this article we are going to discuss the idea behind it.

What are the techniques?

There are two fundamental algorithms that comes into play when there’s a need to generate recommendations. In next section these techniques are discussed in detail.

Content-Based Filtering

The idea behind content based filtering is to analyse a set of features which will provide a similarity between items themselves i.e. between two movies, two products or two songs etc. These set of features once compared gives a similarity score at the end which can be used as a reference for the recommendations.

There are several steps involved to get to this similarity score and the first step is to construct a profile for each item by representing some of the important features of that item. In other terms, this steps requires to define a set of characteristics that are discovered easily. For instance, consider that there’s an article which a user has already read and once you know that this user likes this article you may want to show him recommendations of similar articles. Now, using content based filtering technique you could find the similar articles. The easiest way to do that is to set some features for this article like publisher, genre, author etc. Based on these features similar articles can be recommended to the user (as illustrated in Figure 1). There are three main similarity measures one could use to find the similar articles mentioned below.

 

Figure 1: Content-Based Filtering

 

 

Minkowski distance

Minkowski distance between two variables can be calculated as:

(x,y)= (\sum_{i=1}^{n}{|X_{i} - Y_{i}|^{p}})^{1/p}

 

Cosine Similarity

Cosine similarity between two variables can be calculated as :

  \mbox{Cosine Similarity} = \frac{\sum_{i=1}^{n}{x_{i} y_{i}}} {\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}} \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}} \

 

Jaccard Similarity

 

  J(X,Y) = |X ∩ Y| / |X ∪ Y|

 

These measures can be used to create a matrix which will give you the similarity between each movie and then a function can be defined to return the top 10 similar articles.

 

Collaborative filtering

This filtering method focuses on finding how similar two users or two products are by analyzing user behavior or preferences rather than focusing on the content of the items. For instance consider that there are three users A,B and C.  We want to recommend some movies to user A, our first approach would be to find similar users and compare which movies user A has not yet watched and recommend those movies to user A.  This approach where we try to find similar users is called as User-User Collaborative Filtering.  

The other approach that could be used here is when you try to find similar movies based on the ratings given by others, this type is called as Item-Item Collaborative Filtering. The research shows that item-item collaborative filtering works better than user-user collaborative filtering as user behavior is really dynamic and changes over time. Also, there are a lot more users and increasing everyday but on the other side item characteristics remains the same. To calculate the similarities we can use Cosine distance.

 

Figure 2: Collaborative Filtering

 

Recently some companies have started to take advantage of both content based and collaborative filtering techniques to make a hybrid recommendation engine. The results from both models are combined into one hybrid model which provides more accurate recommendations. Five steps are involved to make a recommendation engine work which are collection of data, storing of data, analyzing the data, filtering the data and providing recommendations. There are a lot of attributes that are involved in order to collect user data including browsing history, page views, search logs, order history, marketing channel touch points etc. which requires a strong data architecture.  The collection of data is pretty straightforward but it can be overwhelming to analyze this amount of data. Storing this data could get tricky on the other hand as you need a scalable database for this kind of data. With the rise of graph databases this area is also improving for many use cases including recommendation engines. Graph databases like Neo4j can also help to analyze and find similar users and relationship among them. Analyzing the data can be carried in different ways, depending on how strong and scalable your architecture you can run real time, batch or near real time analysis. The fourth step involves the filtering of the data and here you can use any of the above mentioned approach to find similarities to finally provide the recommendations.

Having a good recommendation engine can be time consuming initially but it is definitely beneficial in the longer run. It not only helps to generate revenue but also helps to to improve your product catalog and customer service.

Multi-touch attribution: A data-driven approach

Customers shopping behavior has changed drastically when it comes to online shopping, as nowadays, customer likes to do a thorough market research about a product before making a purchase.

What is Multi-touch attribution?

This makes it really hard for marketers to correctly determine the contribution for each marketing channel to which a customer was exposed to. The path a customer takes from his first search to the purchase is known as a Customer Journey and this path consists of multiple marketing channels or touchpoints. Therefore, it is highly important to distribute the budget between these channels to maximize return. This problem is known as multi-touch attribution problem and the right attribution model helps to steer the marketing budget efficiently. Multi-touch attribution problem is well known among marketers. You might be thinking that if this is a well known problem then there must be an algorithm out there to deal with this. Well, there are some traditional models  but every model has its own limitation which will be discussed in the next section.

Types of attribution models

Most of the eCommerce companies have a performance marketing department to make sure that the marketing budget is spent in an agile way. There are multiple heuristics attribution models pre-existing in google analytics however there are several issues with each one of them. These models are:

Traditional attribution models

First touch attribution model

100% credit is given to the first channel as it is considered that the first marketing channel was responsible for the purchase.

Figure 1: First touch attribution model

Last touch attribution model

100% credit is given to the last channel as it is considered that the first marketing channel was responsible for the purchase.

Figure 2: Last touch attribution model

Linear-touch attribution model

In this attribution model, equal credit is given to all the marketing channels present in customer journey as it is considered that each channel is equally responsible for the purchase.

Figure 3: Linear attribution model

U-shaped or Bath tub attribution model

This is most common in eCommerce companies, this model assigns 40% to first and last touch and 20% is equally distributed among the rest.

Figure 4: Bathtub or U-shape attribution model

Data driven attribution models

Traditional attribution models follows somewhat a naive approach to assign credit to one or all the marketing channels involved. As it is not so easy for all the companies to take one of these models and implement it. There are a lot of challenges that comes with multi-touch attribution problem like customer journey duration, overestimation of branded channels, vouchers and cross-platform issue, etc.

Switching from traditional models to data-driven models gives us more flexibility and more insights as the major part here is defining some rules to prepare the data that fits your business. These rules can be defined by performing an ad hoc analysis of customer journeys. In the next section, I will discuss about Markov chain concept as an attribution model.

Markov chains

Markov chains concepts revolves around probability. For attribution problem, every customer journey can be seen as a chain(set of marketing channels) which will compute a markov graph as illustrated in figure 5. Every channel here is represented as a vertex and the edges represent the probability of hopping from one channel to another. There will be an another detailed article, explaining the concept behind different data-driven attribution models and how to apply them.

Figure 5: Markov chain example

Challenges during the Implementation

Transitioning from a traditional attribution models to a data-driven one, may sound exciting but the implementation is rather challenging as there are several issues which can not be resolved just by changing the type of model. Before its implementation, the marketers should perform a customer journey analysis to gain some insights about their customers and try to find out/perform:

  1. Length of customer journey.
  2. On an average how many branded and non branded channels (distinct and non-distinct) in a typical customer journey?
  3. Identify most upper funnel and lower funnel channels.
  4. Voucher analysis: within branded and non-branded channels.

When you are done with the analysis and able to answer all of the above questions, the next step would be to define some rules in order to handle the user data according to your business needs. Some of the issues during the implementation are discussed below along with their solution.

Customer journey duration

Assuming that you are a retailer, let’s try to understand this issue with an example. In May 2016, your company started a Fb advertising campaign for a particular product category which “attracted” a lot of customers including Chris. He saw your Fb ad while working in the office and clicked on it, which took him to your website. As soon as he registered on your website, his boss called him (probably because he was on Fb while working), he closed everything and went for the meeting. After coming back, he started working and completely forgot about your ad or products. After a few days, he received an email with some offers of your products which also he ignored until he saw an ad again on TV in Jan 2019 (after 3 years). At this moment, he started doing his research about your products and finally bought one of your products from some Instagram campaign. It took Chris almost 3 years to make his first purchase.

Figure 6: Chris journey

Now, take a minute and think, if you analyse the entire journey of customers like Chris, you would realize that you are still assigning some of the credit to the touchpoints that happened 3 years ago. This can be solved by using an attribution window. Figure 6 illustrates that 83% of the customers are making a purchase within 30 days which means the attribution window here could be 30 days. In simple words, it is safe to remove the touchpoints that happens after 30 days of purchase. This parameter can also be changed to 45 days or 60 days, depending on the use case.

Figure 7: Length of customer journey

Removal of direct marketing channel

A well known issue that every marketing analyst is aware of is, customers who are already aware of the brand usually comes to the website directly. This leads to overestimation of direct channel and branded channels start getting more credit. In this case, you can set a threshold (say 7 days) and remove these branded channels from customer journey.

Figure 8: Removal of branded channels

Cross platform problem

If some of your customers are using different devices to explore your products and you are not able to track them then it will make retargeting really difficult. In a perfect world these customers belong to same journey and if these can’t be combined then, except one, other paths would be considered as “non-converting path”. For attribution problem device could be thought of as a touchpoint to include in the path but to be able to track these customers across all devices would still be challenging. A brief introduction to deterministic and probabilistic ways of cross device tracking can be found here.

Figure 9: Cross platform clash

How to account for Vouchers?

To better account for vouchers, it can be added as a ‘dummy’ touchpoint of the type of voucher (CRM,Social media, Affiliate or Pricing etc.) used. In our case, we tried to add these vouchers as first touchpoint and also as a last touchpoint but no significant difference was found. Also, if the marketing channel of which the voucher was used was already in the path, the dummy touchpoint was not added.

Figure 10: Addition of Voucher as a touchpoint

Attribution Models in Marketing

Attribution Models

A Business and Statistical Case

INTRODUCTION

A desire to understand the causal effect of campaigns on KPIs

Advertising and marketing costs represent a huge and ever more growing part of the budget of companies. Studies have found out this share is as high as 10% and increases with the size of companies (CMO study by American Marketing Association and Duke University, 2017). Measuring precisely the impact of a specific marketing campaign on the sales of a company is a critical step towards an efficient allocation of this budget. Would the return be higher for an euro spent on a Facebook ad, or should we better spend it on a TV spot? How much should I spend on Twitter ads given the volume of sales this channel is responsible for?

Attribution Models have lately received great attention in Marketing departments to answer these issues. The transition from offline to online marketing methods has indeed permitted the collection of multiple individual data throughout the whole customer journey, and  allowed for the development of user-centric attribution models. In short, Attribution Models use the information provided by Tracking technologies such as Google Analytics or Webtrekk to understand customer journeys from the first click on a Facebook ad to the final purchase and adequately ponderate the different marketing campaigns encountered depending on their responsibility in the final conversion.

Issues on Causal Effects

A key question then becomes: how to declare a channel is responsible for a purchase? In other words, how can we isolate the causal effect or incremental value of a campaign ?

          1. A/B-Tests

One method to estimate the pure impact of a campaign is the design of randomized experiments, wherein a control and treated groups are compared.  A/B tests belong to this broad category of randomized methods. Provided the groups are a priori similar in every aspect except for the treatment received, all subsequent differences may be attributed solely to the treatment. This method is typically used in medical studies to assess the effect of a drug to cure a disease.

Main practical issues regarding Randomized Methods are:

  • Assuring that control and treated groups are really similar before treatment. Uually a random assignment (i.e assuring that on a relevant set of observable variables groups are similar) is realized;
  • Potential spillover-effects, i.e the possibility that the treatment has an impact on the non-treated group as well (Stable unit treatment Value Assumption, or SUTVA in Rubin’s framework);
  • The costs of conducting such an experiment, and especially the costs linked to the deliberate assignment of individuals to a group with potentially lower results;
  • The number of such experiments to design if multiple treatments have to be measured;
  • Difficulties taking into account the interaction effects between campaigns or the effect of spending levels. Indeed, usually A/B tests are led by cutting off temporarily one campaign entirely and measuring the subsequent impact on KPI’s compared to the situation where this campaign is maintained;
  • The dynamical reproduction of experiments if we assume that treatment effects may change over time.

In the marketing context, multiple campaigns must be tested in a dynamical way, and treatment effect is likely to be heterogeneous among customers, leading to practical issues in the lauching of A/B tests to approximate the incremental value of all campaigns. However, sites with a lot of traffic and conversions can highly benefit from A/B testing as it provides a scientific and straightforward way to approximate a causal impact. Leading companies such as Uber, Netflix or Airbnb rely on internal tools for A/B testing automation, which allow them to basically test any decision they are about to make.

References:

Books:

Experiment!: Website conversion rate optimization with A/B and multivariate testing, Colin McFarland, ©2013 | New Riders  

A/B testing: the most powerful way to turn clicks into customers. Dan Siroker, Pete Koomen; Wiley, 2013.

Blogs:

https://eng.uber.com/xp

https://medium.com/airbnb-engineering/growing-our-host-community-with-online-marketing-9b2302299324

Study:

https://cmosurvey.org/wp-content/uploads/sites/15/2018/08/The_CMO_Survey-Results_by_Firm_and_Industry_Characteristics-Aug-2018.pdf

        2. Attribution models

Attribution Models do not demand to create an experimental setting. They take into account existing data and derive insights from the variability of customer journeys. One key difficulty is then to differentiate correlation and causality in the links observed between the exposition to campaigns and purchases. Indeed, selection effects may bias results as exposure to campaigns is usually dependant on user-characteristics and thus may not be necessarily independant from the customer’s baseline conversion probabilities. For example, customers purchasing from a discount price comparison website may be intrinsically different from customers buying from FB ad and this a priori difference may alone explain post-exposure differences in purchasing bahaviours. This intrinsic weakness must be remembered when interpreting Attribution Models results.

                          2.1 General Issues

The main issues regarding the implementation of Attribution Models are linked to

  • Causality and fallacious reasonning, as most models do not take into account the aforementionned selection biases.
  • Their difficult evaluation. Indeed, in almost all attribution models (except for those based on classification, where the accuracy of the model can be computed), the additionnal value brought by the use of a given attribution models cannot be evaluated using existing historical data. This additionnal value can only be approximated by analysing how the implementation of the conclusions of the attribution model have impacted a given KPI.
  • Tracking issues, leading to an uncorrect reconstruction of customer journeys
    • Cross-device journeys: cross-device issue arises from the use of different devices throughout the customer journeys, making it difficult to link datapoints. For example, if a customer searches for a product on his computer but later orders it on his mobile, the AM would then mistakenly consider it an order without prior campaign exposure. Though difficult to measure perfectly, the proportion of cross-device orders can approximate 20-30%.
    • Cookies destruction makes it difficult to track the customer his the whole journey. Both regulations and consumers’ rising concerns about data privacy issues mitigate the reliability and use of cookies.1 – From 2002 on, the EU has enacted directives concerning privacy regulation and the extended use of cookies for commercial targeting purposes, which have highly impacted marketing strategies, such as the ‘Privacy and Electronic Communications Directive’ (2002/58/EC). A research was conducted and found out that the adoption of this ‘Privacy Directive’ had led to 64% decrease in advertising methods compared to the rest of the world (Goldfarb et Tucker (2011)). The effect was stronger for generalized sites (Yahoo) than for specialized sites.2 – Users have grown more and more conscious of data privacy issues and have adopted protective measures concerning data privacy, such as automatic destruction of cookies after a session is ended, or simply giving away less personnal information (Goldfarb et Tucker (2012) ) .Valuable user information may be lost, though tracking technologies evolution have permitted to maintain tracking by other means. This issue may be particularly important in countries highly concerned with data privacy issues such as Germany.
    • Offline/Online bridge: an Attribution Model should take into account all campaigns to draw valuable insights. However, the exposure to offline campaigns (TV, newspapers) are difficult to track at the user level. One idea to tackle this issue would be to estimate the proportion of conversions led by offline campaigns through AB testing and deduce this proportion from the credit assigned to the online campaigns accounted for in the Attribution Model.
    • Touch point information available: clicks are easy to follow but irrelevant to take into account the influence of purely visual campaigns such as display ads or video.

                          2.2 Today’s main practices

Two main families of Attribution Models exist:

  • Rule-Based Attribution Models, which have been used for in the last decade but from which companies are gradualy switching.

Attribution depends on the individual journeys that have led to a purchase and is solely based on the rank of the campaign in the journey. Some models focus on a single touch points (First Click, Last Click) while others account for multi-touch journeys (Bathtube, Linear). It can be calculated at the customer level and thus doesn’t require large amounts of data points. We can distinguish two sub-groups of rule-based Attribution Models:

  • One Touch Attribution Models attribute all credit to a single touch point. The First-Click model attributes all credit for a converion to the first touch point of the customer journey; last touch attributes all credit to the last campaign.
  • Multi-touch Rule-Based Attribution Models incorporate information on the whole customer journey are thus an improvement compared to one touch models. To this family belong Linear model where credit is split equally between all channels, Bathtube model where 40% of credit is given to first and last clicks and the remaining 20% is distributed equally between the middle channels, or time-decay models where credit assigned to a click diminishes as the time between the click and the order increases..

The main advantages of rule-based models is their simplicity and cost effectiveness. The main problems are:

– They are a priori known and can thus lead to optimization strategies from competitors
– They do not take into account aggregate intelligence on customer journeys and actual incremental values.
– They tend to bias (depending on the model chosen) channels that are over-represented at the beggining or end of the funnel, according to theoretical assumptions that have no observationnal back-ups.

  • Data-Driven Attribution Models

These models take into account the weaknesses of rule-based models and make a relevant use of available data. Being data-driven, following attribution models cannot be computed using single user level data. On the contrary values are calculated through data aggregation and thus require a certain volume of customer journey information.

References:

https://dspace.mit.edu/handle/1721.1/64920

 

        3. Data-Driven Attribution Models in practice

                          3.1 Issues

Several issues arise in the computation of campaigns individual impact on a given KPI within a data-driven model.

  • Selection biases: Exposure to certain types of advertisement is usually highly correlated to non-observable variables which are in turn correlated to consumption practices. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in conversion probabilities between groups whether than by the campaign effect.
  • Complementarity: it may be that campaigns A and B only have an effect when combined, so that measuring their individual impact would lead to misleading conclusions. The model could then try to assess the effect of combinations of campaigns on top of the effect of individual campaigns. As the number of possible non-ordered combinations of k campaigns is 2k, it becomes clear that inclusing all possible combinations would however be time-consuming.
  • Order-sensitivity: The effect of a campaign A may depend on the place where it appears in the customer journey, meaning the rank of a campaign and not merely its presence could be accounted for in the model.
  • Relative Order-sensitivity: it may be that campaigns A and B only have an effect when one is exposed to campaign A before campaign B. If so, it could be useful to assess the effect of given combinations of campaigns as well. And this for all campaigns, leading to tremendous numbers of possible combinations.
  • All previous phenomenon may be present, increasing even more the potential complexity of a comprehensive Attribution Model. The number of all possible ordered combination of k campaigns is indeed :

 

                          3.2 Main models

                                  A) Logistic Regression and Classification models

If non converting journeys are available, Attribition Model can be shaped as a simple classification issue. Campaign types or campaigns combination and volume of campaign types can be included in the model along with customer or time variables. As we are interested in inference (on campaigns effect) whether than prediction, a parametric model should be used, such as Logistic Regression. Non paramatric models such as Random Forests or Neural Networks can also be used though the interpretation of campaigns value would be more difficult to derive from the model results.

A common pitfall is the usual issue of spurious correlations on one hand and the correct interpretation of coefficients in business terms.

An advantage if the possibility to evaluate the relevance of the model using common model validation methods to evaluate its predictive power (validation set \ AUC \pseudo R squared).

                                  B) Shapley Value

Theory

The Shapley Value is based on a Game Theory framework and is named after its creator, the Nobel Price Laureate Lloyd Shapley. Initially meant to calculate the marginal contribution of players in cooperative games, the model has received much attention in research and industry and has lately been applied to marketing issues. This model is typically used by Google Adords and other ad bidding vendors. Campaigns or marketing channels are in this model seen as compementary players looking forward to increasing a given KPI.
Contrarily to Logistic Regressions, it is a non-parametric model. Contrarily to Markov Chains, all results are built using existing journeys, and not simulated ones.

Channels are considered to enter the game sequentially under a certain joining order. Shapley value try to The Shapley value of channel i is the weighted sum of the marginal values that channel i adds to all possible coalitions that don’t contain channel i.
In other words, the main logic is to analyse the difference of gains when a channel i is added after a coalition Ck of k channels, k<=n. We then sum all the marginal contributions over all possible ordered combination Ck of all campaigns excluding i, with k<=n-1.

Subsets framework

A first an most usual way to compute the Shapley Vaue is to consider that when a channel enters coalition, its additionnal value is the same irrelevant of the order in which previous channels have appeared. In other words, journeys (A>B>C) and (B>A>C) trigger the same gains.
Shapley value is computed as the gains associated to adding a channel i to a subset of channels, weighted by the number of (ordered) sequences that the (unordered) subset represents, summed up on all possible subsets of the total set of campaigns where the channel i is not present.
The Shapley value of the channel ???????? is then:

where |S| is the number of campaigns of a coalition S and the sum extends over all subsets S that do not not contain channel j. ????(????)  is the value of the coalition S and ????(???? ∪ {????????})  the value of the coalition formed by adding ???????? to coalition S. ????(???? ∪ {????????}) − ????(????) is thus the marginal contribution of channel ???????? to the coalition S.

The formula can be rewritten and understood as:

This method is convenient when data on the gains of on all possible permutations of all unordered k subsets of the n campaigns are available. It is also more convenient if the order of campaigns prior to the introduction of a campaign is thought to have no impact.

Ordered sequences

Let us define ????((A>B)) as the value of the sequence A then B. What is we let ????((A>B)) be different from ????((B>A)) ?
This time we would need to sum over all possible permutation of the S campaigns present before  ???????? and the N-(S+1) campaigns after ????????. Doing so we will sum over all possible orderings (i.e all permutations of the n campaigns of the grand coalition containing all campaigns) and we can remove the permutation coefficient s!(p-s+1)!.

This method is convenient when the order of channels prior to and after the introduction of another channel is assumed to have an impact. It is also necessary to possess data for all possible permutations of all k subsets of the n campaigns, and not only on all (unordered) k-subsets of the n campaigns, k<=n. In other words, one must know the gains of A, B, C, A>B, B>A, etc. to compute the Shapley Value.

Differences between the two approaches

We simulate an ordered case where the value for each ordered sequence k for k<=3 is known. We compare it to the usual Shapley value calculated based on known gains of unordered subsets of campaigns. So as to compare relevant values, we have built the gains matrix so that the gains of a subset A, B i.e  ????({B,A}) is the average of the gains of ordered sequences made up with A and B (assuming the number of journeys where A>B equals the number of journeys where B>A, we have ????({B,A})=0.5( ????((A>B)) + ????((B>A)) ). We let the value of the grand coalition be different depending on the order of campaigns-keeping the constraints that it averages to the value used for the unordered case.

Note: mvA refers to the marginal value of A in a given sequence.
With traditionnal unordered coalitions:

With ordered sequences used to compute the marginal values:

 

We can see that the two approaches yield very different results. In the unordered case, the Shapley Value campaign C is the highest, culminating at 20, while A and B have the same Shapley Value mvA=mvB=15. In the ordered case, campaign A has the highest Shapley Value and all campaigns have different Shapley Values.

This example illustrates the inherent differences between the set and sequences approach to Shapley values. Real life data is more likely to resemble the ordered case as conversion probabilities may for any given set of campaigns be influenced by the order through which the campaigns appear.

Advantages

Shapley value has become popular in allocation problems in cooperative games because it is the unique allocation which satisfies different axioms:

  • Efficiency: Shaple Values of all channels add up to the total gains (here, orders) observed.
  • Symmetry: if channels A and B bring the same contribution to any coalition of campaigns, then their Shapley Value i sthe same
  • Null player: if a channel brings no additionnal gains to all coalitions, then its Shapley Value is zero
  • Strong monotony: the Shapley Value of a player increases weakly if all its marginal contributions increase weakly

These properties make the Shapley Value close to what we intuitively define as a fair attribution.

Issues

  • The Shapley Value is based on combinatory mathematics, and the number of possible coalitions and ordered sequences becomes huge when the number of campaigns increases.
  • If unordered, the Shapley Value assumes the contribution of campaign A is the same if followed by campaign B or by C.
  • If ordered, the number of combinations for which data must be available and sufficient is huge.
  • Channels rarely present or present in long journeys will be played down.
  • Generally, gains are supposed to grow with the number of players in the game. However, it is plausible that in the marketing context a journey with a high number of channels will not necessarily bring more orders than a journey with less channels involved.

References:

R package: GameTheoryAllocation

Article:
Zhao & al, 2018 “Shapley Value Methods for Attribution Modeling in Online Advertising “
https://link.springer.com/content/pdf/10.1007/s13278-017-0480-z.pdf
Courses: https://www.lamsade.dauphine.fr/~airiau/Teaching/CoopGames/2011/coopgames-7%5b8up%5d.pdf
Blogs: https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-all-shapley-values-f3e04534983d

                                  B) Markov Chains

Markov Chains are used to model random processes, i.e events that occur in a sequential manner and in such a way that the probability to move to a certain state only depends on the past steps. The number of previous steps that are taken into account to model the transition probability is called the memory parameter of the sequence, and for the model to have a solution must be comprised between 0 and 4. A Markov Chain process is thus defined entirely by its Transition Matrix and its initial vector (i.e the starting point of the process).

Markov Chains are applied in many scientific fields. Typically, they are used in weather forecasting, with the sequence of Sunny and Rainy days following a Markov Process of memory parameter 0, so that for each given day the probability that the next day will be rainy or sunny only depends on the weather of the current day. Other applications can be found in sociology to understand the dynamics of social classes intergenerational reproduction. To get more both mathematical and applied illustration, I recommend the reading of this course.

In the marketing context, Markov Chains are an interesting way to model the conversion funnel. To go from the from the Markov Model to the Attribution logic, we calculate the Removal Effect of each channel, i.e the difference in conversions that happen if the channel is removed. Please read below for an introduction to the methodology.

The first step in a Markov Chains Attribution Model is to build the transition matrix that captures the transition probabilities between the campaigns accross existing customer journeys. This Matrix is to be read as a “From state A to state B” table, from the left to the right. A first difficulty is finding the right memory parameter to use. A large memory parameter would allow to take more into account interraction effects within the conversion funnel but would lead to increased computationnal time, a non-readable transition matrix, and be more sensitive to noisy data. Please note that this transition matrix provides useful information on the conversion funnel and on the relationships between campaigns and can be used as such as an analytical tool. I suggest the clear and easily R code which can be found here or here.

Here is an illustration of a Markov Chain with memory Parameter of 0: the probability to go to a certain campaign B in the next step only depend on the campaign we are currently at:

The associated Transition Matrix is then (with null probabilities left as Blank):

The second step is  to compute the actual responsibility of a channel in total conversions. As mentionned above, the main philosophy to do so is to calculate the Removal Effect of each channel, i.e the changes in the number of conversions when a channel is entirely removed. All customer journeys which went through this channel are settled out to be unsuccessful. This calculation is done by applying the transition matrix with and without the removed channels to an initial vector that contains the number of desired simulations.

Building on our current example, we can then settle an initial vector with the desired number of simulations, e.g 10 000:

 

It is possible at this stage to add a constraint on the maximum number of times the matrix is applied to the data, i.e on the maximal number of campaigns a simulated journey is allowed to have.

Advantages

  • The dynamic journey is taken into account, as well as the transition between two states. The funnel is not assumed to be linear.
  • It is possile to build a conversion graph that maps the customer journey provides valuable insights.
  • It is possible to evaluate partly the accuracy of the Attribution Model based on Markov Chains. It is for example possible to see how well the transition matrix help predict the future by analysing the number of correct predictions at any given step over all sequences.

Disadvantages

  • It can be somewhat difficult to set the memory parameter. Complementarity effects between channels are not well taken into account if the memory is low, but a parameter too high will lead to over-sensitivity to noise in the data and be difficult to implement if customer journeys tend to have a number of campaigns below this memory parameter.
  • Long journeys with different channels involved will be overweighted, as they will count many times in the Removal Effect.  For example, if there are n-1 channels in the customer journey, this journey will be considered as failure for the n-1 channel-RE. If the volume effects (i.e the impact of the overall number of channels in a journey, irrelevant from their type° are important then results may be biased.

References:

R package: ChannelAttribution

Git:

https://github.com/MatCyt/Markov-Chain/blob/master/README.md

Course:

https://www.ssc.wisc.edu/~jmontgom/markovchains.pdf

Article:

“Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling”; Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, 2014. Available at SSRN: https://ssrn.com/abstract=2343077 or http://dx.doi.org/10.2139/ssrn.2343077

“Media Exposure through the Funnel: A Model of Multi-Stage Attribution”, Abhishek & al, 2012

“Multichannel Marketing Attribution Using Markov Chains”, Kakalejčík, L., Bucko, J., Resende, P.A.A. and Ferencova, M. Journal of Applied Management and Investments, Vol. 7 No. 1, pp. 49-60.  2018

Blogs:

https://analyzecore.com/2016/08/03/attribution-model-r-part-1

https://analyzecore.com/2016/08/03/attribution-model-r-part-2

                          3.3 To go further: Tackling selection biases with Quasi-Experiments

Exposure to certain types of advertisement is usually highly correlated to non-observable variables. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in converison probabilities between groups whether than by the campaign effect. These potential selection effects may bias the results obtained using historical data.

Quasi-Experiments can help correct this selection effect while still using available observationnal data.  These methods recreate the settings on a randomized setting. The goal is to come as close as possible to the ideal of comparing two populations that are identical in all respects except for the advertising exposure. However, populations might still differ with respect to some unobserved characteristics.

Common quasi-experimental methods used for instance in Public Policy Evaluation are:

  • Discontinuity Regressions
  • Matching Methods, such as Exact Matching,  Propensity-score matching or k-nearest neighbourghs.

References:

Article:

“Towards a digital Attribution Model: Measuring the impact of display advertising on online consumer behaviour”, Anindya Ghose & al, MIS Quarterly Vol. 40 No. 4, pp. 1-XX, 2016

https://pdfs.semanticscholar.org/4fa6/1c53f281fa63a9f0617fbd794d54911a2f84.pdf

        4. First Steps towards a Practical Implementation

Identify key points of interests

  • Identify the nature of touchpoints available: is the data based on clicks? If so, is there a way to complement the data with A/B tests to measure the influence of ads without clicks (display, video) ? For example, what happens to sales when display campaign is removed? Analysing this multiplier effect would give the overall responsibility of display on sales, to be deduced from current attribution values given to click-based channels. More interestingly, what is the impact of the removal of display campaign on the occurences of click-based campaigns ? This would give us an idea of the impact of display ads on the exposure to each other campaigns, which would help correct the attribution values more precisely at the campaign level.
  • Define the KPI to track. From a pure Marketing perspective, looking at purchases may be sufficient, but from a financial perspective looking at profits, though a bit more difficult to compute, may drive more interesting results.
  • Define a customer journey. It may seem obvious, but the notion needs to be clarified at first. Would it be defined by a time limit? If so, which one? Does it end when a conversion is observed? For example, if a customer makes 2 purchases, would the campaigns he’s been exposed to before the first order still be accounted for in the second order? If so, with a time decay?
  • Define the research framework: are we interested only in customer journeys which have led to conversions or in all journeys? Keep in mind that successful customer journeys are a non-representative sample of customer journeys. Models built on the analysis of biased samples may be conservative. Take an extreme example: 80% of customers who see campaign A buy the product, VS 1% for campaign B. However, campaign B exposure is great and 100 Million people see it VS only 1M for campaign A. An Attribution Model based on successful journeys will give higher credit to campaign B which is an auguable conclusion. Taking into account costs per campaign (in the case where costs are calculated by clicks) may of course tackle this issue partly, as campaign A could then exhibit higher returns, but a serious fallacious reasonning is at stake here.

Analyse the typical customer journey    

  • Performing a duration analysis on the data may help you improve the definition of the customer journey to be used by your organization. After which days are converison probabilities null? Should we consider the effect of campaigns disappears after x days without orders? For example, if 99% of orders are placed in the 30 days following a first click, it might be interesting to define the customer journey as a 30 days time frame following the first oder.
  • Look at the distribution of the number of campaigns in a typical journey. If you choose to calculate the effect of campaigns interraction in your Attribution Model, it may indeed help you determine the maximum number of campaigns to be included in a combination. Indeed, you may not need to assess the impact of channel combinations with above than 4 different channels if 95% of orders are placed after less then 4 campaigns.
  • Transition matrixes: what if a campaign A systematically leads to a campaign B? What happens if we remove A or B? These insights would give clues to ask precise questions for a latter AB test, for example to find out if there is complementarity between channels A and B – (implying none should be removed) or mere substitution (implying one can be given up).
  • If conversion rates are available: it can be interesting to perform a survival analysis i.e to analyse the likelihood of conversion based on duration since first click. This could help us excluse potential outliers or individuals who have very low conversion probabilities.

Summary

Attribution is a complex topic which will probably never be definitively solved. Indeed, a main issue is the difficulty, or even impossibility, to evaluate precisely the accuracy of the attribution model that we’ve built. Attribution Models should be seen as a good yet always improvable approximation of the incremental values of campaigns, and be presented with their intrinsinc limits and biases.

Predictive maintenance in Semiconductor Industry: Part 1

The process in the semiconductor industry is highly complicated and is normally under consistent observation via the monitoring of the signals coming from several sensors. Thus, it is important for the organization to detect the fault in the sensor as quickly as possible. There are existing traditional statistical based techniques however modern semiconductor industries have the ability to produce more data which is beyond the capability of the traditional process.

For this article, we will be using SECOM dataset which is available here.  A lot of work has already done on this dataset by different authors and there are also some articles available online. In this article, we will focus on problem definition, data understanding, and data cleaning.

This article is only the first of three parts, in this article we will discuss the business problem in hand and clean the dataset. In second part we will do feature engineering and in the last article we will build some models and evaluate them.

Problem definition

This data which is collected by these sensors not only contains relevant information but also a lot of noise. The dataset contains readings from 590. Among the 1567 examples, there are only 104 fail cases which means that out target variable is imbalanced. We will look at the distribution of the dataset when we look at the python code.

NOTE: For a detailed description regarding this cases study I highly recommend to read the following research papers:

  •  Kerdprasop, K., & Kerdprasop, N. A Data Mining Approach to Automate Fault Detection Model Development in the Semiconductor Manufacturing Process.
  • Munirathinam, S., & Ramadoss, B. Predictive Models for Equipment Fault Detection in the Semiconductor Manufacturing Process.

Data Understanding and Preparation

Let’s start exploring the dataset now. The first step as always is to import the required libraries.

import pandas as pd
import numpy as np

There are several ways to import the dataset, you can always download and then import from your working directory. However, I will directly import using the link. There are two datasets: one contains the readings from the sensors and the other one contains our target variable and a timestamp.

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
names = ["feature" + str(x) for x in range(1, 591)]
secom_var = pd.read_csv(url, sep=" ", names=names, na_values = "NaN") 


url_l = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
secom_labels = pd.read_csv(url_l,sep=" ",names = ["classification","date"],parse_dates = ["date"],na_values = "NaN")

The first step before doing the analysis would be to merge the dataset and we will us pandas library to merge the datasets in just one line of code.

#Data cleaning
#1. Combined the two datasets
secom_merged = pd.merge(secom_var, secom_labels,left_index=True,right_index=True)

Now let’s check out the distribution of the target variable

secom_merged.target.value_counts().plot(kind = 'bar')

Figure 1: Distribution of Target Variable

From Figure 1 it can be observed that the target variable is imbalanced and it is highly recommended to deal with this problem before the model building phase to avoid bias model. Xgboost is one of the models which can deal with imbalance classes but one needs to spend a lot of time to tune the hyper-parameters to achieve the best from the model.

The dataset in hand contains a lot of null values and the next step would be to analyse these null values and remove the columns having null values more than a certain percentage. This percentage is calculated based on 95th quantile of null values.

#2. Analyzing nulls
secom_rmNa.isnull().sum().sum()
secom_nulls = secom_rmNa.isnull().sum()/len(secom_rmNa)
secom_nulls.describe()
secom_nulls.hist()

Figure 2: Missing percentge in each column

Now we calculate the 95th percentile of the null values.

x = secom_nulls.quantile(0.95)
secom_rmNa = secom_merged[secom_merged.columns[secom_nulls < x]]

Figure 3: Missing percentage after removing columns with more then 45% Na

From figure 3 its visible that there are still missing values in the dataset and can be dealt by using many imputation methods. The most common method is to impute these values by mean, median or mode. There also exist few sophisticated techniques like K-nearest neighbour and interpolation.  We will be applying interpolation technique to our dataset. 

secom_complete = secom_rmNa.interpolate()

To prepare our dataset for analysis we should remove some more unwanted columns like columns with near zero variance. For this we can calulate number of unique values in each column and if there is only one unique value we can delete the column as it holds no information.

df = secom_complete.loc[:,secom_complete.apply(pd.Series.nunique) != 1]

## Let's check the shape of the df
df.shape
(1567, 444)

We have applied few data cleaning techniques and reduced the features from 590 to 444. However, In the next article we will apply some feature engineering techniques and adress problems like the curse of dimensionality and will also try to balance the target variable.

Bleiben Sie dran!!

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process and many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text. The whole code is available here.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

The importance of domain knowledge – A healthcare data science perspective

Data scientists have (and need) many skills. They are frequently either former academic researchers or software engineers, with knowledge and skills in statistics, programming, machine learning, and many other domains of mathematics and computer science. These skills are general and allow data scientists to offer valuable services to almost any field. However, data scientists in some cases find themselves in industries they have relatively little knowledge of.

This is especially true in the healthcare field. In healthcare, there is an enormous amount of important clinical knowledge that might be relevant to a data scientist. It is unreasonable to expect a data scientist to not only have all of the skills typically required of a data scientist, but to also have all of the knowledge a medical professional may have.

Why is domain knowledge necessary?

This lack of domain knowledge, while perfectly understandable, can be a major barrier to healthcare data scientists. For one thing, it’s difficult to come up with project ideas in a domain that you don’t know much about. It can also be difficult to determine the type of data that may be helpful for a project – if you want to build a model to predict a health outcome (for example, whether a patient has or is likely to develop a gastrointestinal bleed), you need to know what types of variables might be related to this outcome so you can make sure to gather the right data.

Knowing the domain is useful not only for figuring out projects and how to approach them, but also for having rules of thumb for sanity checks on the data. Knowing how data is captured (is it hand-entered? Is it from machines that can give false readings for any number of reasons?) can help a data scientist with data cleaning and from going too far down the wrong path. It can also inform what true outliers are and which values might just be due to measurement error.

Often the most challenging part of building a machine learning model is feature engineering. Understanding clinical variables and how they relate to a health outcome is extremely important for this. Is a long history of high blood pressure important for predicting heart problems, or is only very recent history? How long a time horizon is considered ‘long’ or ‘short’ in this context? What other variables might be related to this health outcome? Knowing the domain can help direct the data exploration and greatly speed (and enhance) the feature engineering process.

Once features are generated, knowing what relationships between variables are plausible helps for basic sanity checks. If you’re finding the best predictor of hospitalization is the patient’s eye color, this might indicate an issue with your code. Being able to glance at the outcome of a model and determine if they make sense goes a long way for quality assurance of any analytical work.

Finally, one of the biggest reasons a strong understanding of the data is important is because you have to interpret the results of analyses and modeling work. Knowing what results are important and which are trivial is important for the presentation and communication of results. An analysis that determines there is a strong relationship between age and mortality is probably well-known to clinicians, while weaker but more surprising associations may be of more use. It’s also important to know what results are actionable. An analysis that finds that patients who are elderly are likely to end up hospitalized is less useful for trying to determine the best way to reduce hospitalizations (at least, without further context).

How do you get domain knowledge?

In some industries, such as tech, it’s fairly easy and straightforward to see an end-user’s prospective. By simply viewing a website or piece of software from the user’s point of view, a data scientist can gain a lot of the needed context and background knowledge needed to understand where their data is coming from and how their model output is being used. In the healthcare industry, it’s more difficult. A data scientist can’t easily choose to go through med school or the experience of being treated for a chronic illness. This means there is no easy single answer to where to gain domain knowledge. However, there are many avenues available.

Reading literature and attending presentations can boost one’s domain knowledge. However, it’s often difficult to find resources that are penetrable for someone who is not already a clinician. To gain deep knowledge, one needs to be steeped in the topic. One important avenue to doing this is through the establishment of good relationships with clinicians. Clinicians can be powerful allies that can help point you in the right direction for understanding your data, and simply by chatting with them you can gain important insights. They can also help you visit the clinics or practices to interact with the people that perform the procedures or even watch the procedures being done. At Fresenius Medical Care, where I work, members of my team regularly visit clinics. I have in the last year visited one of our dialysis clinics, a nephrology practice, and a vascular care unit. These experiences have been invaluable to me in developing my knowledge of the treatment of chronic illnesses.

In conclusion, it is crucial for data scientists to acquire basic familiarity in the field they are working in and in being part of collaborative teams that include people who are technically knowledgeable in the field they work in. This said, acquiring even an essential understanding (such as “Medicine 101”) may go a long way for the data scientists in being able to become self-sufficient in essential feature selection and design.

 

Data Science and Predictive Analytics in Healthcare

Doing data science in a healthcare company can save lives. Whether it’s by predicting which patients have a tumor on an MRI, are at risk of re-admission, or have misclassified diagnoses in electronic medical records are all examples of how predictive models can lead to better health outcomes and improve the quality of life of patients.  Nevertheless, the healthcare industry presents many unique challenges and opportunities for data scientists.

The impact of data science in healthcare

Healthcare providers have a plethora of important but sensitive data. Medical records include a diverse set of data such as basic demographics, diagnosed illnesses, and a wealth of clinical information such as lab test results. For patients with chronic diseases, there could be a long and detailed history of data available on a number of health indicators due to the frequency of visits to a healthcare provider. Information from medical records can often be combined with outside data as well. For example, a patient’s address can be combined with other publicly available information to determine the number of surgeons that practice near a patient or other relevant information about the type of area that patients reside in.

With this rich data about a patient as well as their surroundings, models can be built and trained to predict many outcomes of interest. One important area of interest is models predicting disease progression, which can be used for disease management and planning. For example, at Fresenius Medical Care (where we primarily care for patients with chronic conditions such as kidney disease), we use a Chronic Kidney Disease progression model that can predict the trajectory of a patient’s condition to help clinicians decide whether and when to proceed to the next stage in their medical care. Predictive models can also notify clinicians about patients who may require interventions to reduce risk of negative outcomes. For instance, we use models to predict which patients are at risk for hospitalization or missing a dialysis treatment. These predictions, along with the key factors driving the prediction, are presented to clinicians who can decide if certain interventions might help reduce the patient’s risk.

Challenges of data science in healthcare

One challenge is that the healthcare industry is far behind other sectors in terms of adopting the latest technology and analytics tools. This does present some challenges, and data scientists should be aware that the data infrastructure and development environment at many healthcare companies will not be at the bleeding edge of the field. However it also means there are a lot of opportunities for improvement, and even small simple models can yield vast improvements over current methods.

Another challenge in the healthcare sector arises from the sensitive nature of medical information. Due to concerns over data privacy, it can often be difficult to obtain access to data that the company has. For this reason, data scientists considering a position at a healthcare company should be aware of whether there is already an established protocol for data professionals to get access to the data. If there isn’t, be aware that simply getting access to the data may be a major effort in itself.

Finally, it is important to keep in mind the end-use of any predictive model. In many cases, there are very different costs to false-negatives and false-positives. A false-negative may be detrimental to a patient’s health, while too many false-positives may lead to many costly and unnecessary treatments (also to the detriment of patients’ health for certain treatments as well as economy overall). Education about the proper use of predictive models and their limitations is essential for end-users. Finally, making sure the output of a predictive model is actionable is important. Predicting that a patient is at high-risk is only useful if the model outputs is interpretable enough to explain what factors are putting that patient at risk. Furthermore, if the model is being used to plan interventions, the factors that can be changed need to be highlighted in some way – telling a clinician that a patient is at risk because of their age is not useful if the point of the prediction is to lower risk through intervention.

The future of data science in the healthcare sector

The future holds a lot of promise for data science in healthcare. Wearable devices that track all kinds of activity and biometric data are becoming more sophisticated and more common. Streaming data coming from either wearables or devices providing treatment (such as dialysis machines) could eventually be used to provide real-time alerts to patients or clinicians about health events outside of the hospital.

Currently, a major issue facing medical providers is that patients’ data tends to exist in silos. There is little integration across electronic medical record systems (both between and within medical providers), which can lead to fragmented care. This can lead to clinicians receiving out of date or incomplete information about a patient, or to duplication of treatments. Through a major data engineering effort, these systems could (and should) be integrated. This would vastly increase the potential of data scientists and data engineers, who could then provide analytics services that took into account the whole patients’ history to provide a level of consistency across care providers. Data workers could use such an integrated record to alert clinicians to duplications of procedures or dangerous prescription drug combinations.

Data scientists have a lot to offer in the healthcare industry. The advances of machine learning and data science can and should be adopted in a space where the health of individuals can be improved. The opportunities for data scientists in this sector are nearly endless, and the potential for good is enormous.

Artificial Intelligence and Data Science in the Automotive Industry

Data science and machine learning are the key technologies when it comes to the processes and products with automatic learning and optimization to be used in the automotive industry of the future. This article defines the terms “data science” (also referred to as “data analytics”) and “machine learning” and how they are related. In addition, it defines the term “optimizing analytics“ and illustrates the role of automatic optimization as a key technology in combination with data analytics. It also uses examples to explain the way that these technologies are currently being used in the automotive industry on the basis of the major subprocesses in the automotive value chain (development, procurement; logistics, production, marketing, sales and after-sales, connected customer). Since the industry is just starting to explore the broad range of potential uses for these technologies, visionary application examples are used to illustrate the revolutionary possibilities that they offer. Finally, the article demonstrates how these technologies can make the automotive industry more efficient and enhance its customer focus throughout all its operations and activities, extending from the product and its development process to the customers and their connection to the product.

Read this article in German:
“Künstliche Intelligenz und Data Science in der Automobilindustrie“


Table of Contents

1 Introduction

2 The Data Mining Process

3 The pillars of artificial intelligence
3.1 Maschine Learning
3.2 Computer Vision
3.3 Inference and decision-making
3.4 Language and communication
3.5 Agents and Actions

4 Data mining and artificial intelligence in the automotive industry
4.1 Development
4.2 Procurement
4.3 Logistics
4.4 Production
4.5 Marketing
4.6 Sales, After Sales, and Retail
4.7 Connected Customer

5 Vision
5.1 Autonomous vehicles
5.2 Integrated factory optimization
5.3 Companies acting autonomously

6 Conclusions

7 Authors

8 Sources

Authors:

  • Dr. Martin Hofmann (CIO – Volkswagen AG)
  • Dr. Florian Neukart (Principal Data Scientist – Volkswagen AG)
  • Prof. Dr. Thomas Bäck (Universität Leiden)

1 Introduction

Data science and machine learning are now key technologies in our everyday lives, as we can see in a multitude of applications, such as voice recognition in vehicles and on cell phones, automatic facial and traffic sign recognition, as well as chess and, more recently, Go machine algorithms[1] which humans can no longer beat. The analysis of large data volumes based on search, pattern recognition, and learning algorithms provides insights into the behavior of processes, systems, nature, and ultimately people, opening the door to a world of fundamentally new possibilities. In fact, the now already implementable idea of autonomous driving is virtually a tangible reality for many drivers today with the help of lane keeping assistance and adaptive cruise control systems in the vehicle.

The fact that this is just the tip of the iceberg, even in the automotive industry, becomes readily apparent when one considers that, at the end of 2015, Toyota and Tesla’s founder, Elon Musk, each announced investments amounting to one billion US dollars in artificial intelligence research and development almost at the same time. The trend towards connected, autonomous, and artificially intelligent systems that continuously learn from data and are able to make optimal decisions is advancing in ways that are simply revolutionary, not to mention fundamentally important to many industries. This includes the automotive industry, one of the key industries in Germany, in which international competitiveness will be influenced by a new factor in the near future – namely the new technical and service offerings that can be provided with the help of data science and machine learning.

This article provides an overview of the corresponding methods and some current application examples in the automotive industry. It also outlines the potential applications to be expected in this industry very soon. Accordingly, sections 2 and 3 begin by addressing the subdomains of data mining (also referred to as “big data analytics”) and artificial intelligence, briefly summarizing the corresponding processes, methods, and areas of application and presenting them in context. Section 4 then provides an overview of current application examples in the automotive industry based on the stages in the industry’s value chain –from development to production and logistics through to the end customer. Based on such an example, section 5 describes the vision for future applications using three examples: one in which vehicles play the role of autonomous agents that interact with each other in cities, one that covers integrated production optimization, and one that describes companies themselves as autonomous agents.

Whether these visions will become a reality in this or any other way cannot be said with certainty at present – however, we can safely predict that the rapid rate of development in this area will lead to the creation of completely new products, processes, and services, many of which we can only imagine today. This is one of the conclusions drawn in section 6, together with an outlook regarding the potential future effects of the rapid rate of development in this area.

2 The data mining process

Gartner uses the term “prescriptive analytics“ to describe the highest level of ability to make business decisions on the basis of data-based analyses. This is illustrated by the question “what should I do?” and prescriptive analytics supplies the required decision-making support, if a person is still involved, or automation if this is no longer the case.

The levels below this, in ascending order in terms of the use and usefulness of AI and data science, are defined as follows: descriptive analytics (“what has happened?”), diagnostic analytics (“why did it happen?”), and predictive analytics (“what will happen?”) (see Figure 1). The last two levels are based on data science technologies, including data mining and statistics, while descriptive analytics essentially uses traditional business intelligence concepts (data warehouse, OLAP).

In this article, we seek to replace the term “prescriptive analytics“ with the term “optimizing analytics.“ The reason for this is that a technology can “prescribe” many things, while, in terms of implementation within a company, the goal is always to make something “better” with regard to target criteria or quality criteria. This optimization can be supported by search algorithms, such as evolutionary algorithms in nonlinear cases and operation research (OR) methods in – much rarer – linear cases. It can also be supported by application experts who take the results from the data mining process and use them to draw conclusions regarding process improvement. One good example are the decision trees learned from data, which application experts can understand, reconcile with their own expert knowledge, and then implement in an appropriate manner. Here too, the application is used for optimizing purposes, admittedly with an intermediate human step.

Within this context, another important aspect is the fact that multiple criteria required for the relevant application often need to be optimized at the same time, meaning that multi-criteria optimization methods – or, more generally, multi-criteria decision-making support methods – are necessary. These methods can then be used in order to find the best possible compromises between conflicting goals. The examples mentioned include the frequently occurring conflicts between cost and quality, risk and profit, and, in a more technical example, between the weight and passive occupant safety of a body.

Figure 1: The four levels of data analysis usage within a company

These four levels form a framework, within which it is possible to categorize data analysis competence and potential benefits for a company in general. This framework is depicted in Figure 1 and shows the four layers which build upon each other, together with the respective technology category required for implementation.

The traditional Cross-Industry Standard Process for Data Mining (CRISP-DM)[2] includes no optimization or decision-making support whatsoever. Instead, based on the business understanding, data understanding, data preparation, modeling, and evaluation sub-steps, CRISP proceeds directly to the deployment of results in business processes. Here too, we propose an additional optimization step that in turn comprises multi-criteria optimization and decision-making support. This approach is depicted schematically in Figure 2.

Figure 2: Traditional CRISP-DM process with an additional optimization step

It is important to note that the original CRISP model deals with a largely iterative approach used by data scientists to analyze data manually, which is reflected in the iterations between business understanding and data understanding as well as data preparation and modeling. However, evaluating the modeling results with the relevant application experts in the evaluation step can also result in having to start the process all over again from the business understanding sub-step, making it necessary to go through all the sub-steps again partially or completely (e.g., if additional data needs to be incorporated).

The manual, iterative procedure is also due to the fact that the basic idea behind this approach  – as up-to-date as it may be for the majority of applications – is now almost 20 years old and certainly only partially compatible with a big data strategy. The fact is that, in addition to the use of nonlinear modeling methods (in contrast to the usual generalized linear models derived from statistical modeling) and knowledge extraction from data, data mining rests on the fundamental idea that models can be derived from data with the help of algorithms and that this modeling process can run automatically for the most part – because the algorithm “does the work.”

In applications where a large number of models need to be created, for example for use in making forecasts (e.g., sales forecasts for individual vehicle models and markets based on historical data), automatic modeling plays an important role. The same applies to the use of online data mining, in which, for example, forecast models (e.g., for forecasting product quality) are not only constantly used for a production process, but also adapted (i.e., retrained) continuously whenever individual process aspects change (e.g., when a new raw material batch is used). This type of application requires the technical ability to automatically generate data, and integrate and process it in such a way that data mining algorithms can be applied to it. In addition, automatic modeling and automatic optimization are necessary in order to update models and use them as a basis for generating optimal proposed actions in online applications. These actions can then be communicated to the process expert as a suggestion or – especially in the case of continuous production processes – be used directly to control the respective process. If sensor systems are also integrated directly into the production process – to collect data in real time – this results in a self-learning cyber-physical system [3] that facilitates implementation of the Industry 4.0[4] vision in the field of production engineering.

Figure 3: Architecture of an Industry 4.0 model for optimizing analytics

This approach is depicted schematically in Figure 3. Data from the system is acquired with the help of sensors and integrated into the data management system. Using this as a basis, forecast models for the system’s relevant outputs (quality, deviation from target value, process variance, etc.) are used continuously in order to forecast the system’s output. Other machine learning options can be used within this context in order, for example, to predict maintenance results (predictive maintenance) or to identify anomalies in the process. The corresponding models are monitored continuously and, if necessary, automatically retrained if any process drift is observed. Finally, the multi-criteria optimization uses the models to continuously compute optimum setpoints for the system control.  Human process experts can also be integrated here by using the system as a suggestion generator so that a process expert can evaluate the generated suggestions before they are implemented in the original system.

In order to differentiate it from “traditional” data mining, the term “big data” is frequently defined now with three (sometimes even four or five) essential characteristics: volume, velocity, and variety, which refer to the large volume of data, the speed at which data is generated, and the heterogeneity of the data to be analyzed, which can no longer be categorized into the conventional relational database schema. Veracity, i.e., the fact that large uncertainties may also be hidden in the data (e.g., measurement inaccuracies), and finally value, i.e., the value that the data and its analysis represents for a company’s business processes, are often cited as additional characteristics. So it is not just the pure data volume that distinguishes previous data analytics methods from big data, but also other technical factors that require the use of new methods– such as Hadoop and MapReduce – with appropriately adapted data analysis algorithms in order to allow the data to be saved and processed. In addition, so-called “in-memory databases” now also make it possible to apply traditional learning and modeling algorithms in main memory to large data volumes.

This means that if one were to establish a hierarchy of data analysis and modeling methods and techniques, then, in very simplistic terms, statistics would be a subset of data mining, which in turn would be a subset of big data. Not every application requires the use of data mining or big data technologies. However, a clear trend can be observed, which indicates that the necessities and possibilities involved in the use of data mining and big data are growing at a very rapid pace as increasingly large data volumes are being collected and linked across all processes and departments of a company. Nevertheless, conventional hardware architecture with additional main memory is often more than sufficient for analyzing large data volumes in the gigabyte range.

Although optimizing analytics is of tremendous importance, it is also crucial to always be open to the broad variety of applications when using artificial intelligence and machine learning algorithms. The wide range of learning and search methods, with potential use in applications such as image and language recognition, knowledge learning, control and planning in areas such as production and logistics, among many others, can only be touched upon within the scope of this article.

3 The pillars of artificial intelligence

An early definition of artificial intelligence from the IEEE Neural Networks Council was “the study of how to make computers do things at which, at the moment, people are better.”[5] Although this still applies, current research is also focused on improving the way that software does things at which computers have always been better, such as analyzing large amounts of data. Data is also the basis for developing artificially intelligent software systems not only to collect information, but also to:

  • Learn
  • Understand and interpret information
  • Behave adaptively
  • Plan
  • Make inferences
  • Solve problems
  • Think abstractly
  • Understand and interpret ideas and language

3.1 Machine learning

At the most general level, machine learning (ML) algorithms can be subdivided into two categories: supervised and unsupervised, depending on whether or not the respective algorithm requires a target variable to be specified.

Supervised learning algorithms

Apart from the input variables (predictors), supervised learning algorithms also require the known target values (labels) for a problem. In order to train an ML model to identify traffic signs using cameras, images of traffic signs – preferably with a variety of configurations – are required as input variables. In this case, light conditions, angles, soiling, etc. are compiled as noise or blurring in the data; nonetheless, it must be possible to recognize a traffic sign in rainy conditions with the same accuracy as when the sun is shining. The labels, i.e., the correct designations, for such data are normally assigned manually. This correct set of input variables and their correct classification constitute a training data set. Although we only have one image per training data set in this case, we still speak of multiple input variables, since ML algorithms find relevant features in training data and learn how these features and the class assignment for the classification task indicated in the example are associated. Supervised learning is used primarily to predict numerical values (regression) and for classification purposes (predicting the appropriate class), and the corresponding data is not limited to a specific format – ML algorithms are more than capable of processing images, audio files, videos, numerical data, and text. Classification examples include object recognition (traffic signs, objects in front of a vehicle, etc.), face recognition, credit risk assessment, voice recognition, and customer churn, to name but a few.

Regression examples include determining continuous numerical values on the basis of multiple (sometimes hundreds or thousands) input variables, such as a self-driving car calculating its ideal speed on the basis of road and ambient conditions, determining a financial indicator such as gross domestic product based on a changing number of input variables (use of arable land, population education levels, industrial production, etc.), and determining potential market shares with the introduction of new models. Each of these problems is highly complex and cannot be represented by simple, linear relationships in simple equations. Or, to put it another way that more accurately represents the enormous challenge involved: the necessary expertise does not even exist.

Unsupervised learning algorithms

Unsupervised learning algorithms do not focus on individual target variables, but instead have the goal of characterizing a data set in general. Unsupervised ML algorithms are often used to group (cluster) data sets, i.e., to identify relationships between individual data points (that can consist of any number of attributes) and group them into clusters. In certain cases, the output from unsupervised ML algorithms can in turn be used as an input for supervised methods. Examples of unsupervised learning include forming customer groups based on their buying behavior or demographic data, or clustering time series in order to group millions of time series from sensors into groups that were previously not obvious.

In other words, machine learning is the area of artificial intelligence (AI) that enables computers to learn without being programmed explicitly. Machine learning focuses on developing programs that grow and change by themselves as soon as new data is provided. Accordingly, processes that can be represented in a flowchart are not suitable candidates for machine learning – in contrast, everything that requires dynamic and changing solution strategies and cannot be constrained to static rules is potentially suitable for solution with ML. For example, ML is used when:

  • No relevant human expertise exists
  • People are unable to express their expertise
  • The solution changes over time
  • The solution needs to be adapted to specific cases

In contrast to statistics, which follows the approach of making inferences based on samples, computer science is interested in developing efficient algorithms for solving optimization problems, as well as in developing a representation of the model for evaluating inferences. Methods frequently used for optimization in this context include so-called “evolutionary algorithms” (genetic algorithms, evolution strategies), the basic principles of which emulate natural evolution[6]. These methods are very efficient when applied to complex, nonlinear optimization problems.

Even though ML is used in certain data mining applications, and both look for patterns in data, ML and data mining are not the same thing. Instead of extracting data that people can understand, as is the case with data mining, ML methods are used by programs to improve their own understanding of the data provided. Software that implements ML methods recognizes patterns in data and can dynamically adjust the behavior based on them. If, for example, a self-driving car (or the software that interprets the visual signal from the corresponding camera) has been trained to initiate a braking maneuver if a pedestrian appears in front it, this must work with all pedestrians regardless of whether they are short, tall, fat, thin, clothed, coming from the left, coming from the right, etc. In turn, the vehicle must not brake if there is a stationary garbage bin on the side of the road.

The level of complexity in the real world is often greater than the level of complexity of an ML model, which is why, in most cases, an attempt is made to subdivide problems into subproblems and then apply ML models to these subproblems.  The output from these models is then integrated in order to permit complex tasks, such as autonomous vehicle operation, in structured and unstructured environments.

3.2 Computer vision

Computer vision (CV) is a very wide field of research that merges scientific theories from various fields (as is often the case with AI), starting from biology, neuroscience, and psychology and extending all the way to computer science, mathematics, and physics. First, it is important to know how an image is produced physically. Before light hits sensors in a two-dimensional array, it is refracted, absorbed, scattered, or reflected, and an image is produced by measuring the intensity of the light beams through each element in the image (pixel). The three primary focuses of CV are:

  • Reconstructing a scene and the point from which the scene is observed based on an image, an image sequence, or a video.
  • Emulating biological visual perception in order to better understand which physical and biological processes are involved, how the wetware works, and how the corresponding interpretation and understanding work.
  • Technical research and development focuses on efficient, algorithmic solutions – when it comes to CV software, problem-specific solutions that only have limited commonalities with the visual perception of biological organisms are often developed.

All three areas overlap and influence each other. If, for example, the focus in an application is on obstacle recognition in order to initiate an automated braking maneuver in the event of a pedestrian appearing in front of the vehicle, the most important thing is to identify the pedestrian as an obstacle. Interpreting the entire scene – e.g., understanding that the vehicle is moving towards a family having a picnic in a field – is not necessary in this case. In contrast, understanding a scene is an essential prerequisite if context is a relevant input, such as is the case when developing domestic robots that need to understand that an occupant who is lying on the floor not only represents an obstacle that needs to be evaded, but is also probably not sleeping and a medical emergency is occurring.

Vision in biological organisms is regarded as an active process that includes controlling the sensor and is tightly linked to successful performance of an action[7]. Consequently,[8] CV systems are not passive either. In other words, the system must:

  • Be continuously provided with data via sensors (streaming)
  • Act based on this data stream

Having said that, the goal of CV systems is not to understand scenes in images – first and foremost, the systems must extract the relevant information for a specific task from the scene. This means that they must identify a “region of interest” that will be used for processing. Moreover, these systems must feature short response times, since it is probable that scenes will change over time and that a heavily delayed action will not achieve the desired effect. Many different methods have been proposed for object recognition purposes (“what” is located “where” in a scene), including:

  • Object detectors, in which case a window moves over the image and a filter response is determined for each position by comparing a template and the sub-image (window content), with each new object parameterization requiring a separate scan. More sophisticated algorithms simultaneously make calculations based on various scales and apply filters that have been learned from a large number of images.
  • Segment-based techniques extract a geometrical description of an object by grouping pixels that define the dimensions of an object in an image. Based on this, a fixed feature set is computed, i.e., the features in the set retain the same values even when subjected to various image transformations, such as changes in light conditions, scaling, or rotation. These features are used to clearly identify objects or object classes, one example being the aforementioned identification of traffic signs.
  • Alignment-based methods use parametric object models that are trained on data[9],[10]. Algorithms search for parameters, such as scaling, translation, or rotation, that adapt a model optimally to the corresponding features in the image, whereby an approximated solution can be found by means of a reciprocal process, i.e., by features, such as contours, corners, or others, “selecting” characteristic points in the image for parameter solutions that are compatible with the found feature.

With object recognition, it is necessary to decide whether algorithms need to process 2-D or 3-D representations of objects – 2-D representations are very frequently a good compromise between accuracy and availability. Current research (deep learning) shows that even distances between two points based on two 2-D images captured from different points can be accurately determined as an input. In daylight conditions and with reasonably good visibility, this input can be used in addition to data acquired with laser and radar equipment in order to increase accuracy – moreover, a single camera is sufficient to generate the required data. In contrast to 3-D objects, no shape, depth, or orientation information is directly encoded in 2-D images. Depth can be encoded in a variety of ways, such as with the use of laser or stereo cameras (emulating human vision) and structured light approaches (such as Kinect). At present, the most intensively pursued research direction involves the use of superquadrics – geometric shapes defined with formulas, which use any number of exponents to identify structures such as cylinders, cubes, and cones with round or sharp edges. This allows a large variety of different basic shapes to be described with a small set of parameters. If 3-D images are acquired using stereo cameras, statistical methods (such as generating a stereo point cloud) are used instead of the aforementioned shape-based methods, because the data quality achieved with stereo cameras is poorer than that achieved with laser scans.

Other research directions include tracking[11],[12], contextual scene understanding,[13],[14] and monitoring[15], although these aspects are currently of secondary importance to the automotive industry.

3.3 Inference and decision-making

This field of research, referred to in the literature as “knowledge representation & reasoning” (KRR), focuses on designing and developing data structures and inference algorithms. Problems solved by making inferences are very often found in applications that require interaction with the physical world (humans, for example), such as generating diagnostics, planning, processing natural languages, answering questions, etc. KRR forms the basis for AI at the human level.

Making inferences is the area of KRR in which data-based answers need to be found without human intervention or assistance, and for which data is normally presented in a formal system with distinct and clear semantics. Since 1980, it has been assumed that the data involved is a mixture of simple and complex structures, with the former having a low degree of computational complexity and forming the basis for research involving large databases. The latter are presented in a language with more expressive power, which requires less space for representation, and they correspond to generalizations and fine-grained information.

Decision-making is a type of inference that revolves primarily around answering questions regarding preferences between activities, for example when an autonomous agent attempts to fulfill a task for a person. Such decisions are very frequently made in a dynamic domain which changes over the course of time and when actions are executed. An example of this is a self-driving car that needs to react to changes in traffic.

Logic and combinatorics

Mathematical logic is the formal basis for many applications in the real world, including calculation theory, our legal system and corresponding arguments, and theoretical developments and evidence in the field of research and development. The initial vision was to represent every type of knowledge in the form of logic and use universal algorithms to make inferences from it, but a number of challenges arose – for example, not all types of knowledge can be represented simply. Moreover, compiling the knowledge required for complex applications can become very complex, and it is not easy to learn this type of knowledge in a logical, highly expressive language.[16] In addition, it is not easy to make inferences with the required highly expressive language – in extreme cases, such scenarios cannot be implemented computationally, even if the first two challenges are overcome. Currently, there are three ongoing debates on this subject, with the first one focusing on the argument that logic is unable to represent many concepts, such as space, analogy, shape, uncertainty, etc., and consequently cannot be included as an active part in developing AI to a human level. The counterargument states that logic is simply one of many tools. At present, the combination of representative expressiveness, flexibility, and clarity cannot be achieved with any other method or system. The second debate revolves around the argument that logic is too slow for making inferences and will therefore never play a role in a productive system. The counterargument here is that ways exist to approximate the inference process with logic, so processing is drawing close to remaining within the required time limits, and progress is being made with regard to logical inference. Finally, the third debate revolves around the argument that it is extremely difficult, or even impossible, to develop systems based on logical axioms into applications for the real world. The counterarguments in this debate are primarily based on the research of individuals currently researching techniques for learning logical axioms from natural-language texts.

In principle, a distinction is made between four different types of logic[17] which are not discussed any further in this article:

  • Propositional logic
  • First-order predicate logic
  • Modal logic
  • Non-monotonic logic

Automated decision-making, such as that found in autonomous robots (vehicles), WWW agents, and communications agents, is also worth mentioning at this point. This type of decision-making is particularly relevant when it comes to representing expert decision-making processes with logic and automating them. Very frequently, this type of decision-making process takes account of the dynamics of the surroundings, for example when a transport robot in a production plant needs to evade another transport robot. However, this is not a basic prerequisite, for example, if a decision-making process without a clearly defined direction is undertaken in future, e.g., the decision to rent a warehouse at a specific price at a specific location. Decision-making as a field of research encompasses multiple domains, such as computer science, psychology, economics, and all engineering disciplines. Several fundamental questions need to be answered to enable development of automated decision-making systems:

  • Is the domain dynamic to the extent that a sequence of decisions is required or static in the sense that a single decision or multiple simultaneous decisions need to be made?
  • Is the domain deterministic, non-deterministic, or stochastic?
  • Is the objective to optimize benefits or to achieve a goal?
  • Is the domain known to its full extent at all times? Or is it only partially known?

Logical decision-making problems are non-stochastic in nature as far as planning and conflicting behavior are concerned. Both require that the available information regarding the initial and intermediate states be complete, that actions have exclusively deterministic, known effects, and that a specific defined goal exists. These problem types are often applied in the real world, for example in robot control, logistics, complex behavior in the WWW, and in computer and network security.

In general, planning problems consist of an initial (known) situation, a defined goal, and a set of permitted actions or transitions between steps. The result of a planning process is a sequence or set of actions that, when executed correctly, change the executing entity from an initial state to a state that meets the target conditions. Computationally speaking, planning is a difficult problem, even if simple problem specification languages are used. Even when relatively simple problems are involved, the search for a plan cannot run through all state-space representations, as these are exponentially large in the number of states that define the domains. Consequently, the aim is to develop efficient algorithms that represent sub-representations in order to search through these with the hope of achieving the relevant goal. Current research is focused on developing new search methods and new representations for actions and states, which will make planning easier. Particularly when one or more agents acting against each other are taken into account, it is crucial to find a balance between learning and decision-making – exploration for the sake of learning while decisions are being made can lead to undesirable results.

Many problems in the real world are problems with dynamics of a stochastic nature. One example of this is buying a vehicle with features that affect its value, of which we are unaware. These dependencies influence the buying decision, so it is necessary to allow risks and uncertainties to be considered. For all intents and purposes, stochastic domains are more challenging when it comes to making decisions, but they are also more flexible than deterministic domains with regard to approximations – in other words, simplifying practical assumptions makes automated decision-making possible in practice. A great number of problem formulations exist, which can be used to represent various aspects and decision-making processes in stochastic domains, with the best-known being decision networks and Markov decision processes.

Many applications require a combination of logical (non-stochastic) and stochastic elements, for example when the control of robots requires high-level specifications in logic and low-level representations for a probabilistic sensor model. Processing natural languages is another area in which this assumption applies, since high-level knowledge in logic needs to be combined with low-level models of text and spoken signals.

3.4 Language and communication

In the field of artificial intelligence, processing language is considered to be of fundamental importance, with a distinction being made here between two fields: computational linguistics (CL) and natural language processing (NLP). In short, the difference is that CL research focuses on using computers for language processing purposes, while NLP consists of all applications, including machine translation (MT), Q&A, document summarization, information extraction, to name but a few. In other words, NLP requires a specific task and is not a research discipline per se. NLP comprises:

  • Part-of-speech tagging
  • Natural language understanding
  • Natural language generation
  • Automatic summarization
  • Named-entity recognition
  • Parsing
  • Voice recognition
  • Sentiment analysis
  • Language, topic, and word segmentation
  • Co-reference resolution
  • Discourse analysis
  • Machine translation
  • Word sense disambiguation
  • Morphological segmentation
  • Answers to questions
  • Relationship extraction
  • Sentence splitting

The core vision of AI says that a version of first-order predicate logic (“first-order predicate calculus” or “FOPC”) supported by the necessary mechanisms for the respective problem is sufficient for representing language and knowledge. This thesis says that logic can and should supply the semantics underlying natural language. Although attempts to use a form of logical semantics as the key to representing contents have made progress in the field of AI and linguistics, they have had little success with regard to a program that can translate English into formal logic. To date, the field of psychology has also failed to provide proof that this type of translation into logic corresponds to the way in which people store and manipulate “meaning.” Consequently, the ability to translate a language into FOPC continues to be an elusive goal. Without a doubt, there are NLP applications that need to establish logical inferences between sentence representations, but if these are only one part of an application, it is not clear that they have anything to do with the underlying meaning of the corresponding natural language (and consequently with CL/NLP), since the original task for logical structures was inference. These and other considerations have crystallized into three different positions:

  • Position 1: Logical inferences are tightly linked to the meaning of sentences, because knowing their meaning is equivalent to deriving inferences and logic is the best way to do this.
  • Position 2: A meaning exists outside logic, which postulates a number of semantic markers or primes that are appended to words in order to express their meaning – this is prevalent today in the form of annotations.
  • Position 3: In general, the predicates of logic and formal systems only appear to be different from human language, but their terms are in actuality the words as which they appear

The introduction of statistical and AI methods into the field is the latest trend within this context. The general strategy is to learn how language is processed – ideally in the way that humans do this, although this is not a basic prerequisite. In terms of ML, this means learning based on extremely large corpora that have been translated manually by humans. This often means that it is necessary to learn (algorithmically) how annotations are assigned or how part-of-speech categories (the classification of words and punctuation marks in a text into word types) or semantic markers or primes are added to corpora, all based on corpora that have been prepared by humans (and are therefore correct). In the case of supervised learning, and with reference to ML, it is possible to learn potential associations of part-of-speech tags with words that have been annotated by humans in the text, so that the algorithms are also able to annotate new, previously unknown texts. [18] This works the same way for lightly supervised and unsupervised learning, such as when no annotations have been made by humans and the only data presented is a text in a language with texts with identical contents in other languages or when relevant clusters are found in thesaurus data without there being a defined goal. [19] With regard to AI and language, information retrieval (IR) and information extraction (IE) play a major role and correlate very strongly with each other. One of the main tasks of IR is grouping texts based on their content, whereas IE extracts similarly factual elements from texts or is used to be able to answer questions concerning text contents. These fields therefore correlate very strongly with each other, since individual sentences (not only long texts) can also be regarded as documents. These methods are used, for example, in interactions between users and systems, such as when a driver asks the on-board computer a question regarding the owner’s manual during a journey – once the language input has been converted into text, the question’s semantic content is used as the basis for finding the answer in the manual, and then for extracting the answer and returning it to the driver.

3.5 Agents and actions

In traditional AI, people focused primarily on individual, isolated software systems that acted relatively inflexibly to predefined rules. However, new technologies and applications have established a need for artificial entities that are more flexible, adaptive, and autonomous, and that act as social units in multi-agent systems. In traditional AI (see also “physical symbol system hypothesis”[20] that has been embedded into so-called “deliberative” systems), an action theory that establishes how systems make decisions and act is represented logically in individual systems that must execute actions. Based on these rules, the system must prove a theorem – the prerequisite here being that the system must receive a description of the world in which it currently finds itself, the desired target state, and a set of actions, together with the prerequisites for executing these actions and a list of the results for each action. It turned out that the computational complexity involved rendered any system with time limits useless even when dealing with simple problems, which had an enormous impact on symbolic AI, resulting in the development of reactive architectures. These architectures follow if-then rules that translate inputs directly into tasks. Such systems are extremely simple, although they can solve very complex tasks. The problem is that such systems learn procedures rather than declarative knowledge, i.e., they learn attributes that cannot easily be generalized for similar situations. Many attempts have been made to combine deliberative and reactive systems, but it appears that it is necessary to focus either on impractical deliberative systems or on very loosely developed reactive systems – focusing on both is not optimal.

Principles of the new, agent-centered approach

The agent-oriented approach is characterized by the following principles:

  • Autonomous behavior:

“Autonomy” describes the ability of systems to make their own decisions and execute tasks on behalf of the system designer. The goal is to allow systems to act autonomously in scenarios where controlling them directly is difficult. Traditional software systems execute methods after these methods have been called, i.e., they have no choice, whereas agents make decisions based on their beliefs, desires, and intentions (BDI)[21].

  • Adaptive behavior:

Since it is impossible to predict all the situations that agents will encounter, these agents must be able to act flexibly. They must be able to learn from and about their environment and adapt accordingly. This task is all the more difficult if not only nature is a source of uncertainty, but the agent is also part of a multi-agent system. Only environments that are not static and self-contained allow for an effective use of BDI agents – for example, reinforcement learning can be used to compensate for a lack of knowledge of the world. Within this context, agents are located in an environment that is described by a set of possible states. Every time an agent executes an action, it is “rewarded” with a numerical value that expresses how good or bad the action was. This results in a series of states, actions, and rewards, and the agent is compelled to determine a course of action that entails maximization of the reward.

  • Social behavior:

In an environment where various entities act, it is necessary for agents to recognize their adversaries and form groups if this is required by a common goal. Agent-oriented systems are used for personalizing user interfaces, as middleware, and in competitions such as the RoboCup. In a scenario where there are only self-driving cars on roads, the individual agent’s autonomy is not the only indispensable component – car2car communications, i.e., the exchange of information between vehicles and acting as a group on this basis, are just as important. Coordination between the agents results in an optimized flow of traffic, rendering traffic jams and accidents virtually impossible (see also section 5.1, “Vehicles as autonomous, adaptive, and social agents & cities as super-agents”).

In summary, this agent-oriented approach is accepted within the AI community as the direction of the future.

Multi-agent behavior

Various approaches are being pursued for implementing multi-agent behavior, with the primary difference being in the degree of control that designers have over individual agents.[22],[23],[24] A distinction is made here between:

  • Distributed problem-solving systems (DPS)
  • Multi-agent systems (MAS)

DPS systems allow the designer to control each individual agent in the domain, with the solution to the task being distributed among multiple agents. In contrast, MAS systems have multiple designers, each of whom can only influence their own agents with no access to the design of any other agent. In this case, the design of the interaction protocols is extremely important. In DPS systems, agents jointly attempt to achieve a goal or solve a problem, whereas, in MAS systems, each agent is individually motivated and wants to achieve its own goal and maximize its own benefit. The goal of DPS research is to find collaboration strategies for problem-solving, while minimizing the level of communication required for this purpose. Meanwhile, MAS research is looking at coordinated interaction, i.e., how autonomous agents can be brought to find a common basis for communication and undertake consistent actions.[25] Ideally, a world in which only self-driving cars use the road would be a DPS world. However, the current competition between OEMs means that a MAS world will come into being first. In other words, communication and negotiation between agents will take center stage (see also Nash equilibrium).

Multi-agent learning

Multi-agent learning (MAL) has only relatively recently been bestowed a certain degree of attention.[26],[27],[28],[29] The key problems in this area include determining which techniques should be used and what exactly “multi-agent learning” means. Current ML approaches were developed in order to train individual agents, whereas MAL focuses first and foremost on distributed learning. “Distributed” does not necessarily mean that a neural network is used, in which many identical operations run during training and can accordingly be parallelized, but instead that:

  • A problem is split into subproblems and individual agents learn these subproblems in order to solve the main problem using their combined knowledge OR
  • Many agents try to solve the same problem independently of each other by competing with each other

Reinforcement learning is one of the approaches being used in this context.[30]

4 Data mining and artificial intelligence in the automotive industry

At a high level of abstraction, the value chain in the automotive industry can broadly be described with the following subprocesses:

  1. Development
  2. Procurement
  3. Logistics
  4. Production
  5. Marketing
  6. Sales, after-sales, and retail
  7. Connected customer

Each of these areas already features a significant level of complexity, so the following description of data mining and artificial intelligence applications has necessarily been restricted to an overview.

4.1 Development

Vehicle development has become a largely virtual process that is now the accepted state of the art for all manufacturers. CAD models and simulations (typically of physical processes, such as mechanics, flow, acoustics, vibration, etc., on the basis of finite element models) are used extensively in all stages of the development process.

The subject of optimization (often with the use of evolution strategies[31] or genetic algorithms and related methods) is usually less well covered, even though it is precisely here in the development process that it can frequently yield impressive results. Multi-disciplinary optimization, in which multiple development disciplines (such as occupant safety and noise, vibration, and harshness (NVH)) are combined and optimized simultaneously, is still rarely used in many cases due to supposedly excessive computation time requirements. However, precisely this approach offers enormous potential when it comes to agreeing more quickly and efficiently across the departments involved on a common design that is optimal in terms of the requirements of multiple departments.

In terms of the analysis and further use of simulation results, data mining is already being used frequently to generate co-called “response surfaces.” In this application, data mining methods (the entire spectrum, ranging from linear models to Gaussian processes, support vector machines, and random forests) are used in order to learn a nonlinear regression model as an approximation of the representation of the input vectors for the simulation based on the relevant (numerical) simulation results[32]. Since this model needs to have good interpolation characteristics, cross-validation methods that allow the model’s prediction quality for new input vectors to be estimated are typically used for training the algorithms. The goal behind this use of supervised learning methods is frequently to replace computation-time-consuming simulations with a fast approximation model that, for example, represents a specific component and can be used in another application. In addition, this allows time-consuming adjustment processes to be carried out faster and with greater transparency during development.

One example: It is desirable to be able to immediately evaluate the forming feasibility[33] of geometric variations in components during the course of an interdepartmental meeting instead of having to run complex simulations and wait one or two days for the results. A response surface model that has been previously trained using simulations can immediately provide a very good approximation of the risk of excessive thinning or cracks in this type of meeting, which can then be used immediately for evaluating the corresponding geometry.

These applications are frequently focused on or limited to specific development areas, which, among other reasons, is due to the fact that simulation data management, in its role as a central interface between data generation and data usage and analysis, constitutes a bottleneck. This applies especially when simulation data is intended for use across multiple departments, variants, and model series, as is essential for real use of data in the sense of a continuously learning development organization. The current situation in practice is that department-specific simulation data is often organized in the form of file trees in the respective file system within a department, which makes it difficult to access for an evaluation based on machine learning methods. In addition, simulation data may already be very voluminous for an individual simulation (in the range of terabytes for the latest CFD simulations), so efficient storage solutions are urgently required for machine-learning-based analyses.

While simulation and the use of nonlinear regression models limited to individual applications have become the standard, the opportunities offered by optimizing analytics are rarely being exploited. Particularly with regard to such important issues as multi-disciplinary (as well as cross-departmental) machine learning, learning based on historical data (in other words, learning from current development projects for future projects), and cross-model learning, there is an enormous and completely untapped potential for increasing efficiency.

4.2 Procurement

The procurement process uses a wide variety of data concerning suppliers, purchase prices, discounts, delivery reliability, hourly rates, raw material specifications, and other variables. Consequently, computing KPIs for the purpose of evaluating and ranking suppliers poses no problem whatsoever today. Data mining methods allow the available data to be used, for example, to generate forecasts, to identify important supplier characteristics with the greatest impact on performance criteria, or to predict delivery reliability. In terms of optimizing analytics, the specific parameters that an automotive manufacturer can influence in order to achieve optimum conditions are also important.

Overall, the finance business area is a very good field for optimizing analytics, because the available data contains information about the company’s main success factors. Continuous monitoring[34] is worth a brief mention as an example, here with reference to controlling. This monitoring is based on finance and controlling data, which is continuously prepared and reported. This data can also be used in the sense of predictive analytics in order to automatically generate forecasts for the upcoming week or month. In terms of optimizing analytics, analyses of the key influencing parameters, together with suggested optimizing actions, can also be added to the aforementioned forecasts.

These subject areas are more of a vision than a reality at present, but they do convey an idea of what could be possible in the fields of procurement, finance, and controlling.

4.3 Logistics

In the field of logistics, a distinction can be made between procurement logistics, production logistics, distribution logistics, and spare parts logistics.

Procurement logistics considers the process chain extending from the purchasing of goods through to shipment of the material to the receiving warehouse. When it comes to the purchasing of goods, a large amount of historical price information is available for data mining purposes, which can be used to generate price forecasts and, in combination with delivery reliability data, to analyze supplier performance. As for shipment, optimizing analytics can be used to identify and optimize the key cost factors.

A similar situation applies to production logistics, which deals with planning, controlling, and monitoring internal transportation, handling, and storage processes. Depending on the granularity of the available data, it is possible to identify bottlenecks, optimize stock levels, and minimize the time required, for example here.

Distribution logistics deals with all aspects involved in transporting products to customers, and can refer to both new and used vehicles for OEMs. Since the primary considerations here are the relevant costs and delivery reliability, all the subcomponents of the multimodal supply chain need to be taken into account – from rail to ship and truck transportation through to subaspects such as the optimal combination of individual vehicles on a truck. In terms of used-vehicle logistics, optimizing analytics can be used to assign vehicles to individual distribution channels (e.g., auctions, Internet) on the basis of a suitable, vehicle-specific resale value forecast in order to maximize total sale proceeds. GM implemented this approach as long ago as 2003 in combination with a forecast of expected vehicle-specific sales revenue[35].

In spare parts logistics, i.e., the provision of spare parts and their storage, data-driven forecasts of the number of spare parts needing to be held in stock depending on model age and model (or sold volumes) are one important potential application area for data mining, because it can significantly decrease the storage costs.

As the preceding examples show, data analytics and optimization must frequently be coupled with simulations in the field of logistics, because specific aspects of the logistics chain need to be simulated in order to evaluate and optimize scenarios. Another example is the supplier network, which, when understood in greater depth, can be used to identify and avoid critical paths in the logistics chain, if possible. This is particularly important, as the failure of a supplier to make a delivery on the critical path would result in a production stoppage for the automaker. Simulating the supplier network not only allows this type of bottleneck to be identified, but also countermeasures to be optimized. In order to facilitate a simulation that is as detailed and accurate as possible, experience has shown that mapping all subprocesses and interactions between suppliers in detail becomes too complex, as well as nontransparent for the automobile manufacturer, as soon as attempts are made to include Tier 2 and Tier 3 suppliers as well.

This is why data-driven modeling should be considered as an alternative. When this approach is used, a model is learned from the available data about the supplier network (suppliers, products, dates, delivery periods, etc.) and the logistics (stock levels, delivery frequencies, production sequences) by means of data mining methods. The model can then be used as a forecast model in order, for example, to predict the effects of a delivery delay for specific parts on the production process. Furthermore, the use of optimizing analytics in this case makes it possible to perform a worst-case analysis, i.e., to identify the parts and suppliers that would bring about production stoppages the fastest if their delivery were to be delayed. This example very clearly shows that optimization, in the sense of scenario analysis, can also be used to determine the worst-case scenario for an automaker (and then to optimize countermeasures in future).

4.4 Production

Every sub-step of the production process will benefit from the consistent use of data mining. It is therefore essential for all manufacturing process parameters to be continuously recorded and stored. Since the main goal of optimization is usually to improve quality or reduce the incidence of defects, data concerning the defects that occur and the type of defect is required, and it must be possible to clearly assign this data to the process parameters. This approach can be used to achieve significant improvements, particularly in new types of production process – one example being CFRP[36]. Other potential optimization areas include energy consumption and the throughput of a production process per time unit. Optimizing analytics can be applied both offline and online in this context.

When used in offline applications, the analysis identifies variables that have a significant influence on the process. Furthermore, correlations are derived between these influencing variables and their targets (quality, etc.) and, if applicable, actions are also derived from this, which can improve the targets. Frequently, such analyses focus on a specific problem or an urgent issue with the process and can deliver a solution very efficiently – however, they are not geared towards continuous process optimization. Conducting the analyses and interpreting and implementing the results consistently requires manual sub-steps that can be carried out by data scientists or statisticians – usually in consultation with the respective process experts.

In the case of online applications, there is a very significant difference in the fact that the procedure is automated, resulting in completely new challenges for data acquisition and integration, data pre-processing, modeling, and optimization. In these applications, even the provision of process and quality data needs to be automated, as this provides integrated data that can be used as a basis for modeling at any time. This is crucial given that modeling always needs to be performed when changes to the process (including drift) are detected. The resulting forecast models are then used automatically for optimization purposes and are capable of, e.g., forecasting the quality and suggesting (or directly implementing) actions for optimizing the relevant target variable (quality in this case) even further. This implementation of optimizing analytics, with automatic modeling and optimization, is technically available, although it is more a vision than a reality for most users today.

The potential applications include forming technology (conventional as well as for new materials), car body manufacture, corrosion protection, painting, drive trains, and final assembly, and can be adapted to all sub-steps. An integrated analysis of all process steps, including an analysis of all potential influencing factors and their impact on overall quality, is also conceivable in future – in this case, it would be necessary to integrate the data from all subprocesses.

4.5 Marketing

The focus in marketing is to reach the end customer as efficiently as possible and to convince people either to become customers of the company or to remain customers. The success of marketing activities can be measured in sales figures, whereby it is important to differentiate marketing effects from other effects, such as the general financial situation of customers. Measuring the success of marketing activities can therefore be a complex endeavor, since multivariate influencing factors can be involved.

It would also be ideal if optimizing analytics could always be used in marketing, because optimization goals, such as maximizing return business from a marketing activity, maximizing sales figures while minimizing the budget employed, optimizing the marketing mix, and optimizing the order in which things are done, are all vital concerns. Forecast models, such as those for predicting additional sales figures over time as a result of a specific marketing campaign, are only one part of the required data mining results – multi-criteria decision-making support also plays a decisive role in this context.

Two excellent examples of the use of data mining in marketing are the issues of churn (customer turnover) and customer loyalty. In a saturated market, the top priority for automakers is to prevent loss of custom, i.e., to plan and implement optimal countermeasures. This requires information that is as individualized as possible concerning the customer, the customer segment to which the customer belongs, the customer’s satisfaction and experience with their current vehicle, and data concerning competitors, their models, and prices. Due to the subjectivity of some of this data (e.g., satisfaction surveys, individual satisfaction values), individualized churn predictions and optimal countermeasures (e.g., personalized discounts, refueling or cash rewards, incentives based on additional features) are a complex subject that is always relevant.

Since maximum data confidentiality is guaranteed and no personal data is recorded – unless the customer gives their explicit consent in order to receive offers as individually tailored as possible – such analyses and optimizations are only possible at the level of customer segments that represent the characteristics of an anonymous customer subset.

Customer loyalty is closely related to this subject, and takes on board the question of how to retain and optimize, i.e., increase the loyalty of existing customers. Likewise, the topic of “upselling,” i.e., the idea of offering existing customers a higher-value vehicle as their next one and being successful with this offer, is always associated with this. It is obvious that such issues are complex, as they require information about customer segments, marketing campaigns, and correlated sales successes in order to facilitate analysis. However, this data is mostly not available, difficult to collect systematically, and characterized by varying levels of veracity, i.e., uncertainty in the data.

Similar considerations apply to optimizing the marketing mix, including the issue of trade fair participation. In this case, data needs to be collected over longer periods of time, so that it can be evaluated and conclusions can be drawn. For individual marketing campaigns such as mailing campaigns, evaluating the return business rate with regard to the characteristics of the selected target group is a much more likely objective of a data analysis and corresponding campaign optimization.

In principle, very promising potential applications for optimizing analytics can also be found in the marketing field. However, the complexity involved in data collection and data protection, as well as the partial inaccuracy of data collected, means that a long-term approach with careful planning of the data collection strategy is required. The issue becomes even more complex if “soft” factors such as brand image also need to be taken into account in the data mining process – in this case, all data has a certain level of uncertainty, and the corresponding analyses (“What are the most important brand image drivers?” “How can the brand image be improved?”) are more suitable for determining trends than drawing quantitative conclusions. Nevertheless, within the scope of optimization, it is possible to determine whether an action will have a positive or negative impact, thereby allowing the direction to be determined, in which actions should go.

4.6 Sales, after-sales, and retail

The diversity of potential applications and existing applications in this area is significant. Since the “human factor,” embodied by the end customer, plays a crucial role within this context, it is not only necessary to take into account objective data such as sales figures, individual price discounts, and dealer campaigns; subjective customer data such as customer satisfaction analyses based on surveys or third-party market studies covering such subjects as brand image, breakdown rates, brand loyalty, and many others may also be required. At the same time, it is often necessary to procure and integrate a variety of data sources, make them accessible for analysis, and finally analyze them correctly in terms of the potential subjectivity of the evaluations[37] – a process that currently depends to a large extent on the expertise of the data scientists conducting the analysis.

The field of sales itself is closely intermeshed with marketing. After all, the ultimate objective is to measure the success of marketing activities in terms of turnover based on sales figures. A combined analysis of marketing activities (including distribution among individual media, placement frequency, costs of the respective marketing activities, etc.) and sales can be used to optimize market activities in terms of cost and effectiveness, in which case a portfolio-based approach is always used. This means that the optimum selection of a portfolio of marketing activities and their scheduling – and not just focusing on a single marketing activity – is the main priority. Accordingly, the problem here comes from the field of multi-criteria decision-making support, in which decisive breakthroughs have been made in recent years thanks to the use of evolutionary algorithms and new, portfolio-based optimization criteria. However, applications in the automotive industry are still restricted to a very limited scope.

Similarly, customer feedback, warranty repairs, and production are potentially intermeshed as well, since customer satisfaction can be used to derive soft factors and warranty repairs can be used to derive hard factors, which can then be coupled with vehicle-specific production data and analyzed. In this way, factors that affect the occurrence of quality defects not present or foreseeable at the factory can be determined. This makes it possible to forecast such quality defects and use optimizing analytics to reduce their occurrence. Nevertheless, it is also necessary to combine data from completely different areas – production, warranty, and after-sales – in order to make it accessible to the analysis.

In the case of used vehicles, residual value plays a vital role in a company’s fleet or rental car business, as the corresponding volumes of tens of thousands of vehicles are entered into the balance sheet as assets with the corresponding residual value. Today, OEMs typically transfer this risk to banks or leasing companies, although these companies may in turn be part of the OEM’s corporate group. Data mining and, above all, predictive analytics can play a decisive role here in the correct evaluation of assets, as shown by an American OEM as long as ten years ago[38]. Nonlinear forecasting models can be used with the company’s own sales data to generate individualized, equipment-specific residual value forecasts at the vehicle level, which are much more accurate than the models currently available as a market standard. This also makes it possible to optimize distribution channels – even as far as geographically assigning used vehicles to individual auction sites at the vehicle level – in such a way as to maximize a company’s overall sales success on a global basis.

Considering sales operations in greater detail, it is obvious that knowledge regarding each individual customer’s interests and preferences when buying a vehicle or, in future, temporarily using available vehicles, is an important factor. The more individualized the knowledge concerning the sociodemographic factors for a customer, their buying behavior, or even their clicking behavior on the OEM’s website, as well as their driving behavior and individual use of a vehicle, the more accurately it will be possible to address their needs and provide them with an optimal offer for a vehicle (suitable model with appropriate equipment features) and its financing.

4.7 Connected customer

While this term is not yet established as such at present, it does describe a future in which both the customer and their vehicle are fully integrated with state-of-the-art information technology. This aspect is closely linked to marketing and sales issues, such as customer loyalty, personalized user interfaces, vehicle behavior in general, and other visionary aspects (see also section 5). With a connection to the Internet and by using intelligent algorithms, a vehicle can react to spoken commands and search for answers that, for example, can communicate directly with the navigation system and change the destination. Communication between vehicles makes it possible to collect and exchange information on road and traffic conditions, which is much more precise and up-to-date than that which can be obtained via centralized systems. One example is the formation of black ice, which is often very localized and temporary, and which can be detected and communicated in the form of a warning to other vehicles very easily today.

5 Vision

Vehicle development already makes use of “modular systems” that allow components to be used across multiple model series. At the same time, development cycles are becoming increasingly shorter. Nevertheless, the field of virtual vehicle development has not yet seen any effective attempts to use machine learning methods in order to facilitate automatic learning that extracts both knowledge that is built upon other historical knowledge and knowledge that applies to more than one model series so as to assist with future development projects and organizing them more efficiently. This topic is tightly intermeshed with that of data management, the complexity of data mining in simulation and optimization data, and the difficulty in defining a suitable representation of knowledge concerning vehicle development aspects. Furthermore, this approach is restricted by the organizational limitations of the vehicle development process, which is often still exclusively oriented towards the model being developed. Moreover, due to the heterogeneity of data (often numerical data, but also images and videos, e.g., from flow fields) and the volume of data (now in the terabyte range for a single simulation), the issue of “data mining in simulation data” is extremely complex and, at best, the object of tentative research approaches at this time[39].

New services are becoming possible due to the use of predictive maintenance. Automatically learned knowledge regarding individual driving behavior – i.e., annual, seasonal, or even monthly mileages, as well as the type of driving – can be used to forecast intervals for required maintenance work (brake pads, filters, oil, etc.) with great accuracy. Drivers can use this information to schedule garage appointments in a timely manner, and the vision of a vehicle that can schedule garage appointments by itself in coordination with the driver’s calendar – which is accessible to the on-board computer via appropriate protocols – is currently more realistic than the often cited refrigerator that automatically reorders groceries.

In combination with automatic optimization, the local authorized repair shop, in its role as a central coordination point where individual vehicle service requests arrive via an appropriate telematics interface, can optimally schedule service appointments in real time – keeping workloads as evenly distributed as possible while taking staff availability into account, for example.

With regard to the vehicle’s learning and adaptation abilities, there is virtually limitless potential. Vehicles can identify and classify their drivers’ driving behavior – i.e., assign them to a specific driver type. Based on this, the vehicles themselves can make adjustments to systems ranging from handling to the electronic user interface – in other words, they can offer drivers individualization and adaptation options that extend far beyond mere equipment features. The learned knowledge about the driver can then be transferred to a new vehicle when one is purchased, ensuring that the driver’s familiar environment is immediately available again.

5.1 Vision – Vehicles as autonomous, adaptive, and social agents & cities as super-agents

Research into self-driving cars is here to stay in the automotive industry, and the “mobile living room” is no longer an implausible scenario, but is instead finding a more and more positive response. Today, the focus of development is on autonomy, and for good reason: In most parts of the world, self-driving cars are not permitted on roads, and if they are, they are not widespread. This means that the vehicle as an agent cannot communicate with all other vehicles, and that vehicles driven by humans adjust their behavior based on the events in their drivers’ field of view. Navigation systems offer support by indicating traffic congestion and suggesting alternative routes. However, we now assume that every vehicle is a fully connected agent, with the two primary goals of:

  • Contributing to optimizing the flow of traffic
  • Preventing accidents

In this scenario, agents communicate with each other and negotiate routes with the goal of minimizing total travel time (obvious parameters being, for example, the route distance, the possible speed, roadworks, etc.). Unforeseeable events are minimized, although not eliminated completely – for example, storm damage would still result in a road being blocked. This type of information would then need to be communicated immediately to all vehicles in the relevant action area, after which a new optimization cycle would be required. Moreover, predictive maintenance minimizes damage to vehicles. Historical data is analyzed and used to predict when a defect would be highly likely to occur, and the vehicle (the software in the vehicle, i.e., the agent) makes a service appointment without requiring any input from the passenger and then drives itself to the repair shop –all made possible with access to the passenger’s calendar. In the event of damage making it impossible to continue a journey, this would also be communicated as quickly as possible – either with a “breakdown” broadcast or to a control center, and a self-driving tow truck would be immediately available to provide assistance, ideally followed by a (likewise self-driving) replacement vehicle. In other words, vehicles act:

  • Autonomously in the sense that they automatically follow a route to a destination
  • Adaptively in the sense that they can react to unforeseen events, such as road closures and breakdowns
  • Socially in the sense that they work together to achieve the common goals of optimizing the flow of traffic and preventing accidents (although the actual situation is naturally more complex and many subgoals need to be defined in order for this to be achieved).

In combination with taxi services that would also have a self-driving fleet, it would be possible to use travel data and information on the past use of taxi services (provided that the respective user gives their consent) in order to send taxis to potential customers at specific locations without the need for these customers to actively request the taxis. In a simplified form, it would also be possible to implement this in an anonymized manner, for example, by using data to identify locations where taxis are frequently needed (as identified with the use of clusters; see also section 3.1, “Machine learning”) at specific times or for specific events (rush hour, soccer matches, etc.).

If roads become digital as well, i.e., if asphalt roads are replaced with glass and supplemented with OLED technology, dynamic changes to traffic management would also be possible. From a materials engineering perspective, this is feasible:

  • The surface structure of glass can be developed in such a way as to be skid resistant, even in the rain.
  • Glass can be designed to be so flexible and sturdy that it will not break, even when trucks drive over it.
  • The waste heat emitted by displays can be used to heat roads and prevent ice from forming during winter.

In this way, cities themselves can be embedded as agents in the multi-agent environment and help achieve the defined goals.

5.2 Vision – integrated factory optimization

By using software to analyze customer and repair shop reports and repair data concerning defects occurring in the field, we can already automatically analyze whether an increase in defects can be expected for specific vehicle models or installed parts. The goal here is to identify and avoid potential problems at an early stage, before large-scale recall actions need to be initiated. Causes of defects in the field can be manifold, including deficient quality of the parts being used or errors during production, which, together with the fact that thousands of vehicles leave Volkswagen production plants every day, makes it clear that acting quickly is of utmost importance. Assuming that at present (November 2015), the linguistic analysis of customer statements and repair shop reports shows that a significant increase in right-hand-side parking light failures can be expected for model x, platform C vehicles delivered from July 2015 onwards. In this case, “significant” means that a statistically verifiable upwards trend (increase in reported malfunctions) can be extracted based on vehicle sales between January 2015 and November 2015. By analyzing fault chains and repair chains, it is possible to determine which events will result in a fault or defect or which other models are or will be affected. If the error in production is caused by a production robot, for example, this can be traced back to a hardware fault and/or software error or to an incorrect or incomplete configuration. In the worst-case scenario, it may even be necessary to update the control system in order to eliminate the error. In addition, it is not possible to update software immediately, because work on patches can only begin after the manufacturer has received and reviewed the problem report. Likewise, reconfiguring the robot can be a highly complex task due to the degrees of freedom (axes of rotation) that multi-axis manipulators have at their disposal. In short, making such corrections is time-consuming, demanding, and, in all but ideal scenarios, results in subsequent issues.

Artificial intelligence (AI) approaches can be used to optimize this process at several points.

One of the areas being addressed by AI research is enabling systems (where the term “system” is a synonym for “a whole consisting of multiple individual parts,” especially in the case of software or combinations of hardware and software, such as those found in industrial robots) to automatically extract and interpret knowledge from data, although the extent of this is still limited at present.

Figure 4 – Data, knowledge, action

In contrast to data, knowledge can form the basis for an action, and the result of an action can be fed back into data, which then forms the basis for new knowledge, and so on.

If an agent with the ability to learn and interpret data is supplied with the results (state of the world before the action, state of the world after the action; see also section 3.5) of its own actions or of the actions of other agents, the agent, provided it has a goal and the freedom to adapt as necessary, will attempt to achieve its goal autonomously. Biological agents, such as humans and animals, do this intuitively without needing to actively control or monitor the process of transforming data into knowledge. If, for instance, wood in a DIY project splits because we hammered in a nail too hard at an excessively acute angle, our brain subconsciously transforms the angle, the material’s characteristics, and the force of the hammer blow into knowledge and experience, minimizing the likelihood of us repeating the same mistake.

In the previously discussed, specific area of artificial intelligence referred to as “machine learning,” research is focused on emulating such behavior. Using ML to enable software to learn from data in a specific problem domain and to infer how to solve new events on the basis of past events opens up a world of new possibilities. ML is nothing new in the field of data analysis, where it has been used for many years now. What is new is the possibility to compute highly complex models with data volumes in the petabyte range within a specific time limit. If one thinks of a production plant as an organism pursuing the objective of producing defect-free vehicles, it is clear that granting this organism access to relevant data would help the organism with its own development and improvement, provided, of course, that this organism has the aforementioned capabilities.

Two stages of development are relevant in this case:

Stage 1 – Learning from data and applying experiences

In order to learn from data, a robot must not just operate according to static programming, it must also be able to use ML methods to work autonomously towards defined learning goals. With regard to any production errors that may occur, this means, first and foremost, that the actions being carried out that result in these errors will have been learned, and not programmed based on a flowchart and an event diagram. Assume, for example, that the aforementioned parking light problem has not only been identified, but that its cause can also been traced back to an issue in production, e.g., a robot that is pushing a headlamp into its socket too hard. All that’s now required is to define the learning goal for the corrective measure. Let us also assume that the production error is not occurring with robots in other production plants, and that left-hand headlamps are being installed correctly in general.  In the best-case scenario, we, as humans, would be able to visually recognize and interpret the difference between robots that are working correctly and robots that are not – and the robot making the mistake should be able to learn in a similar way. The difference here is in the type of perception involved – digital systems can “see” much better than us in such cases. Even though the internal workings of ML methods implemented by means of software are rarely completely transparent during the learning process – even for the developer of the learning system – due to the stochastic components and complexity involved, the action itself is transparent, i.e., not how a system does something, but what it does. These signals need to be used in order to initiate the learning process anew and to adapt the control system of the problematic robot. In the aforementioned case, these would be the manipulator and effector motion signals of a robot that is working correctly, which can be measured and defined with any desired level of accuracy. This does not require any human intervention, as the system’s complete transparency is ensured by continuously securing and analyzing the data accrued in the production process. Neither is any human analysis required in the identification and transmission of defects from the field. Based on linguistic analyses of repair shop and customer reports, together with repair data, we can already swiftly identify which problems are attributable to production. The corresponding delivery of this data to the relevant agents (the data situation makes it possible to determine exactly which machine needs to correct itself) allows these agents to learn from the defects and correct themselves.

Stage 2 – Overcoming the limitations of programming – smart factories as individuals

What if the production plant needs to learn things for which even the flexibility of one or more ML methods used by individual agents (such as production or handling robots) is insufficient? Just like a biological organism, a production plant could act as a separate entity composed of subcomponents, similarly to a human, who can be addressed using natural language, understands context, and is capable of interpreting this. The understanding and interpretation of context have always been a challenge in the field of AI research. AI theory views context as a shared (or common) interpretation of a situation, with the context of a situation and the context of an entity relative to a situation being relevant here. Contexts relevant to a production plant include everything that is relevant to production when expressed in natural language or any other way. The following simplified scenario helps in understanding the concept: Let us assume that the final design for a car body is agreed upon by a committee during a meeting.

“We decided on the following body for the Golf 15 facelift. Please build a prototype accordingly based on the Golf 15,” says Mr. Müller while looking at the 3-D model that seems to be floating in front of everyone at the meeting and can only be seen with augmented reality glasses.

In this scenario, use of evolutionary algorithms for simulation is conceivable, limited to the possible combinations that can actually be built. Provided that the required computing power is available and the parameters involved have been reduced, this can cut simulation times from several hours to minutes, making dynamic morphing of components or component combinations possible during a meeting.

Factory : “Based on the model input, I determined that it will take 26 minutes to adjust the programming of my robots. In order to assemble the floor assembly, tools x1, y1 must be replaced with tools x2, y2 on robots x, y. Production of the prototype will be completed in 6 hours and 37 minutes.”

Of course, this scenario is greatly simplified, but it should still show what the future may hold. In order to understand what needs to be done, the production plant must understand what a car body is, what a facelift is, etc. and interpret the parameters and output from a simulation in such a way that they can be converted into production steps. This conversion into production steps requires the training of individual ML components on the robots or the adaptation/enhancement of their programs based on the simulation data, so that all steps can be carried out, from cutting sheet metal to assembling and integrating the (still fictitious) Golf 15 basic variant.  And while this encompasses a wide range of methods, extending from natural language understanding and language generation through to planning, optimization, and autonomous model generation, it is by no means mere science fiction.

5.3 Vision – companies acting autonomously

When planning marketing activities or customer requirements, for example, it is imperative for companies of all types to monitor how sales change over time, to predict how markets will develop and which customers will potentially be lost, to respond to financial crises, and to quickly interpret the potential impact of catastrophes or political structures. We already do all this today, and what we need for it is data. We are not interested in the personal data of individuals, but in what can be derived from many individual components. For example, by analyzing over 1,600 indicators, we can predict how certain financial indicators for markets will move and respond accordingly or we can predict, with a high probability of being correct, which customer groups find models currently in pre-production development appealing and then derive marketing actions accordingly. In fact, we can go so far as to determine fully configured models to suit the tastes of specific customer groups. With this knowledge, we make decisions, adjust our production levels, prepare marketing plans, and propose fully configured models appropriate for specific customer groups in the configurator.

Preparing a marketing plan sometimes follows a static process (what needs to be done), but how something is done remains variable. As soon as it is possible to explain to another person how and why something is being done, this information can also be made available to algorithms. Breaking it down into an example, we can predict that one of our competitors opening a new production plant in a country where we already have manufacturing operations would result in us having to expect a drop in our sales. We can even predict (within a certain fluctuation range) how large this expected drop in sales would be. In this case, “we” refers to the algorithms that we have developed for this specific use case. Since this situation occurs more than once and requires (virtually) identical input parameters every time, we can use the same algorithms to predict events in other countries. This makes it possible to use knowledge from past marketing campaigns in order to conduct future campaigns. In short, algorithms would prepare the marketing plans.

When provided with a goal, such as maximizing the benefit for our customers while taking account of cost-effectiveness, algorithm sub-classes would be able to take internal data (such as sales figures and configurator data) and external data (such as stock market trends, financial indicators, political structures) to autonomously generate output for a “marketing program” or “GDP program.” If a company were allowed to use its resources and act autonomously, then it would be able to react autonomously to fluctuations in markets, subsidize vulnerable suppliers, and much more. The possibilities are wide-ranging, and such scenarios can already be technically conceived today. Continuously monitoring stock prices, understanding and interpreting news items, and taking into account demographic changes are only a few of the areas that are relevant and that, among other things, require a combination of natural language understanding, knowledge-based systems, and the ability to make logical inferences. In the field of AI research, language and visual information are very frequently used as the basis for understanding things, because we humans also learn and understand a great deal using language and visual stimuli.

6 Conclusions

Artificial intelligence has already found its way into our daily lives, and is no longer solely the subject of science fiction novels. At present, AI is used primarily in the following areas:

  • Analytical data processing
  • Domains in which qualified decisions need to be made quickly on the basis of a large amount of (often heterogeneous) data
  • Monotonous activities that still require constant alertness

In the field of analytical data processing, the next few years will see us transition from exclusive use of decision-support systems to additional use of systems that make decisions on our behalf. Particularly in the field of data analysis, we are currently developing individual analytical solutions for specific problems, although these solutions cannot be used across different contexts – for example, a solution developed to detect anomalies in stock price movements cannot be used to understand the contents of images. This will remain the case in the future, although AI systems will integrate individual interacting components and consequently be able to take care of increasingly complex tasks that are currently reserved exclusively for humans – a clear trend that we can already observe today. A system that not only processes current data regarding stock markets, but that also follows and analyzes the development of political structures based on news texts or videos, extracts sentiments from texts in blogs or social networks, monitors and predicts relevant financial indicators, etc. requires the integration of many different subcomponents – getting these to interact and cooperate is the subject of current research, and new advances in this field are being published every week. In a world where AI systems are able to improve themselves continuously and, for example, manage companies more effectively than humans, what would be left for humans? Time for expanding one’s knowledge, improving society, eradicating hunger, eliminating diseases, and spreading our species beyond our own solar system.[40] Some theories say that quantum computers are required in order to develop powerful AI systems[41], and only a very careless person would suggest than an effective quantum computer will be available within the next 10 years. Then again, only a careless person would suggest that it will not. Regardless of this, and as history has taught us time and time again with the majority of relevant scientific accomplishments, caution will also have to be exercised when implementing artificial intelligence – systems capable of making an exponentially larger number of decisions in extremely short times as hardware performance improves can achieve many positive things, but they can also be misused.

Authors

Dr. Martin Hofmann

Dr. Martin Hofmann is Executive Vice President of the Volkswagen Group and Group CIO at Volkswagen AG. One of his most prominent initiatives has been the creation of Information Technology Labs in San Francisco, Berlin and Munich, which are considered leading in the field of data science, applied AI and Machine Learning in the automotive industry. He has a long history at the Volkswagen Group, joining in 2001, where he took charge of Group Procurement Process and Information Management.

Previously, he worked at the international IT service provider Electronic Data Systems Corporation (EDS) where he held several senior management positions and served as Executive Director Digital Supply Chain in the United States.

Dr. Hofmann graduated Harvard Business School AMP, has a PhD in engineering from the ETH Zurich and a degree in business computer science and business administration from the University of Mannheim.

Dr. Florian Neukart

Dr. Florian Neukart is Principal Data Scientist at Volkswagen Group of America. He is lecturer and scientist in the fields of quantum computers and artificial intelligence at University of Leiden. Furthermore, he is the author of “Reverse Engineering the Mind – Consciously Acting Machines and Accelerated Evolution”.

Prof. Dr. Thomas Bäck

Prof. Dr. Thomas Bäck is scientist for global optimization, predictive analytics and Industry 4.0. As an entrepreneur, he has more than 20 years experience in industrial applications with companies such as 3M, Air Liquide, BMW, Daimler, Unilever and Volkswagen. Furthermore, he is an author of more than 300 scientific publications, e.g. “Evolutionary Algorithms in Theory and Practice” and co-inventor of 4 patents.

Sources

[1] D. Silver et. al.: Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature 529, 484-489 (January 28, 2016).

[2] https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

[3] Systems “in which information and software components are connected to mechanical and electronic components and in which data is transferred and exchanged, and monitoring and control tasks are carried out, in real-time using infrastructures such as the Internet.” (Translation of the following article in Gabler Wirtschaftslexikon, Springer:  http://wirtschaftslexikon.gabler.de/Definition/cyber-physische-systeme.html).

[4] Industry 4.0 is defined therein as “a marketing term that is also used in science communication and refers to a ‘future project’ of the German federal government. The so-called ‘Fourth Industrial Revolution’ is characterized by the customization and hybridization of products and the integration of customers and business partners into business processes.” (Translation of the following article in Gabler Wirtschaftslexikon, Springer: http://wirtschaftslexikon.gabler.de/Definition/industrie-4-0.html).

[5] E. Rich, K. Knight: Artificial Intelligence, 5, 1990

[6] Th. Bäck, D.B. Fogel, Z. Michalewicz: Handbook of Evolutionary Computation, Institute of Physics Publishing, New York, 1997.

[7] R. Bajcsy: Active perception, Proceedings of the IEEE, 76:996-1005, 1988

[8] J. L. Crowley, H. I. Christensen: Vision as a Process: Basic Research on Computer Vision Systems, Berlin: Springer, 1995

[9] D. P. Huttenlocher, S. Ulman: Recognizing Solid Objects by Alignment with an Image, International Journal of Computer Vision, 5: 195-212, 1990

[10] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[11] F. Chaumette, S. Hutchinson: Visual Servo Control I: Basic Approaches, IEEE Robotics and Automation Magazine, 13(4): 82-90, 2006

[12] E. D. Dickmanns: Dynamic Vision for Perception and Control of Motion, London: Springer, 2007

[13] T. M. Straat, M. A. Fischler: Context-Based Vision: Recognizing Objects Using Information from Both 2D and 3D Imagery, IEEE Transactions on Pattern Analysis and Machine Intelligence, 13: 1050-65, 1991

[14] D. Hoiem, A. A. Efros, M. Hebert: Putting Objects in Perspective, Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2137-44, 2006

[15] H. Buxton: Learning and Understanding Dynamic Scene Activity: A Review, Vision Computing, 21: 125-36, 2003

[16] N. Lavarac, S. Dzeroski: Inductive Logic Programming, Vol. 3: Non-Monotonic Reasoning and Uncertain Reasoning, Oxford University Press: Oxford, 1994

[17] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[18] G. Leech, R. Garside, M. Bryant: . CLAWS4: The Tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp. 622-628, 1994

[19] K. Spärck Jones: Information Retrieval and Artificial Intelligence, Artificial Intelligence 141: 257-81, 1999

[20] A. Newell, H. A. Simon: Computer Science as Empirical Enquiry: Symbols and Search, Communications of the ACM 19:113-26

[21] M. Bratman, D. J. Israel, M. E. Pollack: Plans and Resource-Bounded Practical Reasoning, Computational Intelligence, 4: 156-72, 1988

[22] H. Ah. Bond, L. Gasser: Readings in Distributed Artificial Intelligence, San Mateo, CA: Morgan Kaufmann, 1988

[23] E. H. Durfee: Coordination for Distributed Problem Solvers, Boston, MA: Kluwer Academic, 1988

[24] G. Weiss: Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, Cambridge, MA: MIT Press, 1999

[25] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[26] E. Alonso: Multi-Agent Learning, Special Issue of Autonomous Agents and Multi-Agent Systems 15(1), 2007

[27] E. Alonso: M. d’Inverno, D. Kudenko, M. Luck, J. Noble: Learning in Multi-Agent Systems, Knowledge Engineering Review 16: 277-84, 2001

[28] M. Veloso, P. Stone: Multiagent Systems: A Survey from a Machine Learning Perspective, Autonomous Robots 8: 345-83, 2000

[29] R. Vohra, M. Wellmann: Foundations of Multi-Agent Learning, Artificial Intelligence 171:363-4, 2007

[30] L. Busoniu, R. Babuska, B. De Schutter: A Comprehensive Survey of Multi-Agent Reinforcement Learning, IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews 38: 156-72, 2008

[31] “Evolution strategies” are a variant of “evolutionary algorithms,” which has been developed in Germany. They offer significantly better performance than genetic algorithms for this type of task. See also Th. Bäck: Evolutionary Algorithms in Theory and Practice, Oxford University Press, NY, 1996.

[32] For example: Th. Bäck, C. Foussette, P. Krause: Automatische Metamodellierung von CAE-Simulationsmodellen (Automatic Meta-Modeling of CAE Simulation Models), ATZ – Automobiltechnische Zeitschrift 117(5), 64-69, 2015.

[33] For details: http://www.divis-gmbh.de/fileadmin/download/fallbeispiele/140311_Fallbeispiel_BMW_Machbarkeitsbewertung_Umformsimulation_DE.pdf

[34] This term is used in a great many ways and is actually very general in nature. We use it here in the sense of continuous monitoring of company KPIs that are automatically generated and analyzed on a weekly basis, for example.

[35] http://www.automotive-fleet.com/news/story/2003/02/gm-installs-nutechs-vehicle-distribution-system.aspx

[36] C. Sorg: Data Mining als Methode zur Industrialisierung und Qualifizierung neuer Fertigungsprozesse für CFK-Bauteile in automobiler Großserienproduktion (Data Mining as a Method for the Industrialization and Qualification of New Production Processes for CFRP Components in Large-Scale Automotive Production). Dissertation, Technical University of Munich, 2014. Dr. Hut Verlag.

[37] One example can be found in this article: http://www.enbis.org/activities/events/current/214_ENBIS_12_in_Ljubljana/programmeitem/1183_Ask_the_Right_Questions__or_Apply_Involved_Statistics__Thoughts_on_the_Analysis_of_Customer_Satisfaction_Data.

[38] http://www.syntragy.com/doc/q3-05%5B1%5D.pdf

[39] L. Gräning, B. Sendhoff: Shape Mining: A Holistic Data Mining Approach to Engineering Design. Advanced Engineering Informatics 28(2), 166-185, 2014.

[40] F. Neukart: Reverse-Engineering the Mind; see https://www.scribd.com/mobile/doc/195264056/Reverse-Engineering-the-Mind, 2014

[41] F. Neukart: On Quantum Computers and Artificial Neural Networks, Signal Processing Research 2(1), 2013