It has been said many times and in many ways that “Data is at the heart of it all” but what if there is no data or it is scattered around the organization in different formats and in people’s heads?
You would be forgiven for asking “Am I the only one and what can I do about it?”
At the outset, I can comfort you that you are by no means the only one and that this can be remedied. The availability and quality of data is an essential consideration in any analytics project. We often face related issues, and optimization projects almost always address these challenges in one way or another.
How important is the data?
It is clear that optimization is part of the toolkit within advanced analytics, and in particular prescriptive analytics, and of course data is of great importance. In our experience, the best result from an optimization project is obtained when sufficient communication ensures an understanding of the problem to be optimized, the ability to develop a corresponding mathematical model and an optimization algorithm that receives high-quality data as input. If one of these is not in place, there is a risk that the end result of the project will not meet the goals set for it.
Indeed, the quality of the results generated by an optimization solution is often directly proportional to the quality of the data, i.e., poor or inaccurate data are likely to yield out-of-date results or even content that drives wrong business decisions. That is why it is worth investing in data quality, pre-validation and comprehensive testing, in any optimization project.
On the other hand, if the situation in your organization is that decisions are already being made with incomplete data, then optimization is unlikely to make the situation worse. After starting the optimization project, the importance of data quality and the importance of quality development become concrete and topical, not to mention the value of data is also better understood. Data is not the new ‘oil’, i.e. something that has value in itself. The analytics project will create the business purpose/requirement and a “suction” for high-quality data, which results in more focus on data collection and quality.
What data does the optimization need?
Once an optimization project has started, the business need has been defined and a mathematical model has been formulated, the question arises very soon as to what data the optimization needs and what data already exists.
Experts from both business and IT are usually needed to address these issues. During the discussion with the end users, it is clarified what information they use in their work and what else should be taken into account. In practice the persons involved are usually ones who manage the process-to-be-optimized manually and the discussion focuses on the concrete details contained in the goals, constraints, and decisions to be made in the mathematical model. Once this information has been written down, efforts will be made with business and IT experts to locate where the information can be found.
Different file formats, both possible and impossible
After locating the data, the overall end result is that not all of the data is available in a beautifully structured form over an API. It is here that the right advice becomes highly valuable.
At least in theory, all file formats can be read. In practice, some of the less commonly used file formats have poorly available packages that can parse data. Also, files stored in a purely binary format (such as old Excel files) are laborious, requiring either a library to open the file outside of Excel or a file format conversion. The third tedious category is file types that model non-text-based data (e.g., image files).
Another challenge is often that the data is not in a machine-readable form. A typical example is an Excel sheet with hand-compiled data all over the place without any sensible logic. Or the need for data generation requires the use of complex reasoning, which rarely results in completely error-free data. Incorrect and unknown encoding also causes headaches.
The easiest file formats are text files of a certain size (eg csv), which can be read also visually for error checking.
There is however no reason to give up, even if the data proves difficult to acquire. Weoptit’s senior developers have dug up data for projects e.g. from Google Maps street pictures and also let their imagination run wild to find relevant data sources. One really old-school way of acquiring data was when order data was obtained from strips printed on an old matrix printer, by first reading them in with a scanner and then using a text recognition application.
What concrete options are there to improve the data?
There are several options for improving data quality, the most prominent are Master Data Management projects and centralized data warehousing. If these have already been done or there is a need to get moving on a smaller scale, you may want to take a look at the practical tips listed below. I compiled a list of tips with Weoptit’s data analytics & optimization troopers that I hope will help you. However, keep in mind what you are doing and how the action will change your data. You can try everything as long as you remember to evaluate and test the results of the experiment.
Invalid data is not as critical as long as it is identified in a timely manner. Therefore, we often include various data pre-validations and checks with our optimization solutions, so that the data that ends up in the optimization and thus the results, can be trusted.
Prediction, simulation and data on “human hard drives”
Prediction and simulation are also methods that are sometimes applied during optimization projects. By simulating, it is possible to create, for example, map-based data and allow prediction of numerical data. For numerical data to be relevant, it is a good idea to focus on producing data where the range and distribution can be understood and evaluated in advance. A great example of simulation is random sets created using various methods, such as a set of random orders with random time windows for a specific area. However, it is good to remember that fine-tuning the optimization model should not be done with such a set, but will require the right data either during the development project or at the latest as part of the deployment project.
In the case of incomplete data, it is sometimes the reality that important information is on “human hard drives”, i.e. in the heads of skilled workers. In this case, the project will not proceed until the tacit knowledge has been transferred to a structured form. There are many ways – experts can be interviewed or a ready-made framework can be created in which they enter the necessary information. In optimization and prescriptive analytics projects, more commonly, manual tasks are partially delegated to the computer. The role of experts is changing and they can spend their time on more productive tasks as the machine handles routine work. It is important to communicate openly about future approaches so that experts are motivated to share their experience with the project.
In case you encounter insurmountable data quality issues, please remember that we can also help you with those.
The author of the blog warmly thanks Weoptit’s senior developers Pekka, Jukka and Patrik for their content input and concrete examples.