How to generate data for machine learning

In recent columns, I’ve been sharing my view on the quality of the data that many companies have in their data warehouses, lakes or swamps. In my experience, most of the data that companies have stored so carefully is useless and will never generate any value for the company. The data that actually is potentially useful tends to require vast amounts of preprocessing before it can be used for machine learning, for example. As a consequence, in most data science teams, more than 90 percent of all time is spent on preprocessing the data before it can even be used for analytics or machine learning.

In a paper that we recently submitted, we studied this problem for system logs. Virtually any software-intensive system generates data capturing the state and significant events in the system at important points in time. The challenge is that, on the one hand, the data captured in logs is intended for human consumption and, consequently, contains a high variability in the structure, content and type of the information for each log entry. On the other hand, the amount of data stored in logs often is phenomenally large. It’s not atypical for systems to generate gigabytes of data for even a single day of operations.

The obvious answer to this conundrum is to use machine learning to derive the relevant information from the system logs. This approach experiences a number of significant challenges due to the way logs are generated. Based on our research in literature and company cases, we identified several challenges.

Due to the nature of data generation, the logs require extensive preprocessing, reducing the value. It’s also quite common that multiple system processes write into the same log file, complicating time series analysis and other machine learning techniques assuming sequential data. Conversely, many systems generate multiple types of log files and establishing a reliable ground truth requires combining data from multiple log files. These log files tend to contain data at fundamentally different levels of abstraction, complicating the training of machine learning models. Once we’re able to apply machine learning models to the preprocessed data, interpretation of the results often requires extensive domain knowledge. Developers are free to add new code to the system that generates log entries in ad-hoc formats. The changing format of log files complicates the use of multiple logs for training machine learning models as the logs aren’t necessarily comparable. Finally, any tools built to process log files, such as automated parsers, fail unpredictably and are very brittle, requiring constant maintenance.

We studied the problem specifically for system logs, but my experience is that our findings are quite typical for virtually any type of automated data generation. Although this is a huge problem for almost all companies that I work with and enormous amounts of resources are spent on preprocessing data to get value out of it, it’s a losing battle. The amount of data generated in any product, by customers, across the company, and so on, will only continue to go up. If we don’t address this problem, every data scientist, engineer and mathematician will soon be doing little else than preprocessing data.

'Data should be generated in such a way that preprocessing isn’t required at all'

The solution, as we propose in the paper, is quite simple: rather than first generating the data and then preprocessing it, we need to build software to generate data in such a format that preprocessing isn’t required at all. Any data should be generated in such a way that it can immediately and automatically be used for machine learning. Preferably without any human intervention.

Accomplishing this goal is a bit more involved than what I can outline in this post, but there are a number of key elements that I believe are common for any approach aiming to achieve this. First, all data should be numerical. Second, all data of the nominal type (different elements have no order nor relationship to each other) should be one-hot encoded, meaning that the elements are mapped to a binary string as long as the number of element types. Third, data of the ordinal type can use the same approach or, in the case of non-dichotomous data, use a variety of encodings. Fourth, interval and ratio data needs to be normalized (mapped to a value between 0 and 1) for optimal use by machine and deep-learning algorithms. Five, where necessary, the statistical distribution of the data needs to be mapped to a standard Gaussian distribution for better training results.

Accomplishing this at the point of data generation may require engineers and developers to interact with data scientists. In addition, it calls for alignment across the organization, which hasn’t been necessary up to now. However, doing so allows companies to build systems that can fully autonomously collect, train and retrain machine learning models and deploy these without any human involvement (see the figure).

System logging for machine learning

Concluding, most data in most companies is useless because it was generated in the wrong way and without proper structure, encoding and standardization. Especially for the use of this data in training machine learning models, this is problematic as it requires extensive amounts of data preprocessing. Rather than improving our data preprocessing activities, we need to generate data in a way that removes the need for any preprocessing at all. Data scientists and engineers would benefit from focusing on how data should be generated. Rather than trying to clean up the mess afterwards, let’s try to not create any mess to begin with.

AI is NOT big data analytics

During the big data era, one of the key tenets of successfully realizing your big data strategy was to create a central data warehouse or data lake where all data was stored. The data analysts could then run their analyses to their hearts’ content and find relevant correlations, outliers, predictive patterns and the like. In this scenario, everyone contributes their data to the data lake, after which a central data science department uses it to provide, typically executive, decision support (Figure 1).

Figure 1: Everyone contributes their data to the data lake, after which a central data science department uses it to provide, typically executive, decision support.

 

Although this looks great in theory, the reality in many companies is, of course, quite a bit different. We see at least four challenges. First, analyzing data from products and customers in the field often requires significant domain knowledge that data scientists in a central department typically lack. This easily results in incorrect interpretations of data and, consequently, inaccurate results.

Second, different departments and groups that collect data often do so in different ways, resulting in similarly looking data but with different semantics. These can be minor differences, such as the frequency of data generation, eg seconds, minutes, hours or days, but also much larger differences, such as data concerning individual products in the field vs similar data concerning an entire product family in a specific category. As data scientists in a central department often seek to relate data from different sources, this easily causes incorrect conclusions to be drawn.

Third, especially with the increased adoption of DevOps, even the same source will, over time, generate different data. As the software evolves, the way data is generated typically changes with it, leading to similar challenges as outlined above. The result is that the promise of the big data era doesn’t always pan out in companies and almost never to the full extent that was expected at the start of the project.

Finally, to gain value from big data analytics requires a strong data science skillset and there simply aren’t that many people around that have this skillset. Training your existing staff to become proficient in data science skills is quite challenging and most certainly harder than providing machine learning education to engineers and developers.

'Every team, business unit or product organization can start with AI'

Many in the industry believe that artificial intelligence applications, and especially machine and deep-learning models, suffer from the same challenges. However, even though both data analytics and ML/DL models are heavily based on data, the main difference is that for ML/DL, there’s no need to create a centralized data warehouse. Instead, every team, business unit or product organization can start with AI without any elaborate coordination with the rest of the company.

Each business unit can build its own ML/DL models and deploy these in the system or solution for which they’re responsible (Figure 2). The data can come from the data lake or from the local data storage solutions, so you don’t even need to have adopted the centralized data storage approach before starting with using ML/DL.

Figure 2: Each business unit can build its own ML/DL models and deploy these in the system or solution for which they’re responsible.

Concluding, AI is **not** data analytics and doesn’t require the same preconditions. Instead, you can start today, just using the data that you have available, even if you and your team are just working on a single function or subsystem. Artificial intelligence and especially deep learning offer amazing potential for reducing cost, as well as for creating new business opportunities. It’s the most exciting technology that has reached maturity in perhaps decades. Rather than waiting for the rest of the world to overtake you, start using AI and DL today.

Why your data is useless

Virtually all organizations I work with have terabytes or even petabytes of data stored in different databases and file systems. However, there’s a very interesting pattern I’ve started to recognize during recent months. On the one hand, the data that gets generated is almost always intended for human interpretation. Consequently, there are lots of alphanumeric data, comments and other unstructured data in these files and databases. On the other hand, the size of the stored data is so phenomenally large that it’s impossible for any human to make heads or tails of it.

The consequence is that enormous amounts of time are required to preprocess the data in order to make it usable for training machine learning models or for inference using already trained models. Data scientists at a number of companies have told me that they and their colleagues spend well over 90 percent of their time and energy on this.

'Most of the data is mud pretending to be oil'

For most organizations, therefore, the only way to generate any value from the vast amounts of data that are stored on their servers is to throw lots and lots of human resources at it. Since, oftentimes, the business case for doing so is unclear or insufficient, the only logical conclusion is that the vast majority of data that’s stored at companies is simply useless. It’s dead weight and will never generate any relevant business value. Although the saying is that “data is the new oil”, the reality is that most of it is mud pretending to be oil.

Even if the data is relevant, there are several challenges associated with using it in analytics or machine learning. The first is timeliness: if you have a data set of, say, customer behavior that’s 24, 12 or even only 6 months old, it’s highly likely that your customer base has evolved and that preferences and behaviors have changed, invalidating your data set.

Second, particularly in companies that release new software frequently, such as when using DevOps, the problem is that with every software version, the way data is generated may have changed. Especially when the data is generated for human consumption, eg engineers debugging systems in operation, it’s time consuming to merge data sets that were produced by different versions of the software.

Third, in many organizations, multiple data sets are generated continuously, even by the same system. To derive the information that’s actually relevant for the company frequently requires combining data from different sets. The challenge is that different data sets may not use the same way of timestamping entries, may store data at very different levels of abstraction and frequency and may evolve in very unpredictable ways. This makes combining the data effort consuming and any automation developed for the purpose very brittle and likely to fail unpredictably.

My main message is that, rather than focusing on preprocessing data, we need to spend much more time and focus on how the data is produced in the first place. The goal should be to generate data such that it doesn’t require any preprocessing at all. This opens up a host of use cases and opportunities that I’ll discuss in future articles.

Concluding, for all the focus on data, the fact of the matter is that in most companies, most data is useless or requires prohibitive amounts of human effort to unlock the value that it contains. Instead, we should focus on how we generate data in the first place. The goal should be to do that in such a way that the data can be used for analytics and machine learning without any preprocessing. So, clean up the mess, get rid of the useless data and generate data in ways that actually make sense.

 

Get-together for group leads, department managers and HR managers: 10 February 2020

We are hereby pleased to invite team leads, department managers and HR managers to a get-together with our trainers on 10 February, 2020 at BCN in Eindhoven.

In a presentation of loosely half an hour, we shall tell all about our ‘Training Highlights’. Afterwards, our content partners and their trainers will be available for questions, over nibbles and drinks.

4:45 PM – Doors open
5:15 PM – Presentation ‘Training Highlights’ (by our content partners)
5:45 PM – Nibbles & drinks
6:15 PM – Announcement ‘Trainer of the year’
7:00 PM – Gifts + End of get-together

Registration is required, so please inform us about your presence by e-mail.

We look forward to welcoming you at this get-together.

The game plan for 2020

In reinforcement learning (a field within AI), algorithms need to learn about an unexplored space. These algorithms need to balance exploration (learning about new options and possibilities) with exploitation (using the acquired knowledge to generate a good outcome). The general rule of thumb is that the less is known about the problem domain, the more the algorithm should focus on exploration. Similarly, the better the problem domain is understood, the more the algorithm should focus on exploitation.

The exploration/exploitation balance applies to companies too. Most companies have, for a long time, been operating in a business ecosystem that was stable and well understood. There were competitors, of course, but everyone basically behaved the same way, got access to new technologies at about the same time, responded to customers the same way, and so on. In such a context, a company naturally focuses more and more on exploitation as the reward for exploration is low. This is exactly what I see in many of the organizations I work with: for all the talk about innovation and business development, the result is almost always sustaining innovations that make the existing product or solution portfolio a bit better.

With digitalization and its constituent technologies – software, data and AI – taking a stronger and stronger hold of industry after industry, the stable business ecosystem is being disrupted in novel and unpredictable ways. Many companies find out the hard way that their customers never cared about their product. Instead, the customer has a need and your product happened to be the best way to meet that need. When a new entrant provides a new solution that meets the need better, your product is replaced with this new solution.

'Companies need to significantly increase the amount of exploration'

The only way to address this challenge is to significantly increase the amount of exploration your company conducts – we’re talking real exploration, where the outcome of efforts is unknown and where everyone understands that the majority of initiatives will fail. To achieve this, though, you need a game plan. This game plan needs to contain, at least, four elements: strategic resource allocation, reduced effort in commodity functionality, exploration of the novel business ecosystems and/or new positions in the existing business ecosystem and exploration of disruptive innovation efforts that are enabled through data and AI.

Many companies allocate the vast majority of their resources to their largest businesses. This makes intuitive sense, but fails to put a longitudinal perspective on the challenge of resource allocation. A model that can be very helpful in this context is the three horizons model. This model structures the businesses the company is in into three buckets. Horizon one are the large, established businesses that, today, pay the bills. Horizon two are the new, rapidly growing businesses that, however, are much smaller than the horizon one businesses. These are intended to be our future horizon one businesses. Horizon three are all the new, unproven innovation initiatives and businesses where it’s uncertain that things will work out but that are the breeding ground for future horizon two businesses. Resource allocation should restrict horizon one resources to maximally 70 percent of the total. Horizon two should get up to 20 percent and at least 10 percent of the total company resources should be allocated to horizon three.

Within horizon one, each business should grow its resource usage slower than revenue growth. That might mean that a horizon one business growing at 5 percent per year should cut its resource usage with 5 percent per year as this business is supposed to act as a cash cow for funding the development of future horizon one businesses.

In most companies, revenue and resource allocation are closely aligned with each other, but this is a mistake from a longitudinal perspective. A new business will require years of investment before it can achieve horizon one status and this new business can’t fund itself. Of course, you can have it bootstrap itself, but the result will typically be that competitors with a more strategic resource allocation will become the market leaders in these new businesses.

'Once you’ve defined the commodity, **stop** virtually all investment in it'

Second, reduce investment in commodity functionality. Our research shows that companies spend 80-90 percent of their resources on functionality and capabilities that customers consider to be commodity. I’ve discussed this in earlier blog posts and columns, but I keep getting surprised at the lack of willingness of companies to look into novel ways of reducing investment in places where it doesn’t pay off. Don’t be stupid and, instead, do a strategic review of your entire product portfolio and the functionality in your products and, together with customers and others, define what’s commodity and what’s differentiating. Once you’ve defined the commodity, **stop** virtually all investment in it. You need those resources for sustaining innovations that drive differentiation for your products.

Third, many companies consider their existing business ecosystem as the one and only way to serve customers. In practice, however, ecosystems get disrupted and it’s far better to be the disruptor than the disruptee. This requires a constant exploration of opportunities to reposition yourself in your existing ecosystem, as well as an exploration of novel ecosystems where your capabilities might also be relevant.

Finally, digital technologies – especially data and AI – offer new ways of meeting customer needs that you must explore in order to avoid being disrupted by, especially, new entrants. Accept that the value in almost every industry is shifting from atoms to bits, that data can be used to subsidize product sales in multi-sided markets, that AI allows for automation of tasks that were impossible to automate even some years ago and, in general, proactively explore the value that digital technologies can provide for you and your customers. This is where the majority of the resources that you freed up through horizon planning and reducing investment in commodity functionality should go.

Concluding, at the beginning of 2020, you need a game plan to significantly increase exploration at the expense of exploitation in order to identify new opportunities and detect disruption risks and to invest sufficiently in areas that provide an opportunity for growth. This requires strategic resource allocation, identifying and removing commodity, a careful review of your position in existing and new business ecosystems and major exploration initiatives in the data and AI space. It’s risky, it’s scary, most initiatives won’t pan out and customers, your shareholders and your own people will scream bloody murder. And yet, the biggest risk is to do nothing at all as that will surely lead to your company’s demise. Will you allow that to happen on your watch?