Work with data
The true informative value of data analyses can only be ensured by following three basic pillars, the observance of which is necessary for efficient data processing. What are they?
The first pillar is a list of questions we want to answer.
When working with data, we are essentially having a live conversation. We ask the data questions, and they give us answers. The person being asked can only answer the question if they have the necessary information. Likewise, a dataset will only respond if it is based on the correct records and variables. This means that we need to carefully consider which questions we need to answer before we even start collecting data. Basically, we will work backwards.
First, we have to write down which reports supported by real data we want to obtain. Then we decide what records and variables we need to capture and analyse to get the report / output.
In short, it can be said that it is always better to request and acquire all available data into the system (database). Time-random provision of partial information for a specific output may prompt additional questions, the answers to which we would be forced to document and collect additional data for. We can always secure a partial report from the maximum possible list of data, so that collecting “all” data is ultimately worthwhile in terms of time and money. On the other hand, a partial database is definitely better than no database if you know its telling limits.
Another aspect is that the data tends to be unsorted and needs cleaning
This is usually the largest and most demanding part of the overall job. We will demonstrate what it is all about using a simple example. Creating a directory of clients in your hotel booking system usually also includes filling in the “title” box (leaving the academic degrees aside). This often includes no title (blank), and “Mr”, “Ms”. If we don’t have a direct drop-down menu to select an option by clicking on it, then we rely on who entered these titles and how. From “Mr” we get to abbreviations and misspellings, i.e. a variation of options opens up, e.g. “mr.” or “M”, or the statistical designation “0”. Therefore, first of all, we have to standardise the data collection – assign a uniform entry variant to each data. All you have to do with the existing data is to modify it to this form.
You can quickly find out how inconsistent the database is by extracting it into a spreadsheet. On a simple field such as the title, we immediately see the result.
In the case of data that is automatically transferred to the hotel directory from the online booking system, it is a good idea to insert only drop-down menus, where the client only selects and clicks on the given option. When entering names and addresses, it is advisable to include a legend showing the form the data should take. For example, show the name and surname exactly as written in the travel document or ID card, without abbreviations of middle and other names. The same applies for address. If any information is not known, choose the option “unknown” or do not fill it in at all. This prevents duplication and multiplication of client cards and other disinformation. If any data from the booking does not match during check-in, they must always be corrected.
The third fact is that data can have unsubstantiated features.
Therefore, before processing them, we must ensure that, for example, a new code that is not specified in the directive was not created during entry, or that the system operator did not fill in an “unknown” field with some other data “just so there is something there”, that there was no functional change in the system itself, etc. Also for that reason, we must first look at the result of the analysis through the filter of common sense. “Does it make sense? Does this output sound probable?”