Artificial Intelligence Solutions Require High-Quality Data

Anthony Quattrone, PhD 1 May 2022

Organisational data is captured and stored in various formats, from spreadsheets to word documents, relational databases and text files. Leveraging organisational data involves a series of pre-processing steps to make it suitable for use in business intelligence systems for reporting and analytics. AI systems require highly specialised datasets for training to ensure a high degree of specialisation.

Preparation of organisational data for use in artificial intelligence systems requires many complex Extract-Transform-Load (ETL) processes to produce a training dataset before inputted into an AI. Many organisations' regulatory framework implies that privacy laws and regulations need to be adhered to before extraction can occur. Further, strict storage processes need to comply with rules once extraction is complete to ensure that data is stored and used securely.

There are vast quantities of data in the current organisational environment, some of which are unstructured in a format easy to work with. There is also a technical challenge in processing this information. Data preparation complexity increases when data is not static and constantly changing in real-time, and requires dynamic processes.

We will explore key data considerations in the following sections.

Common Organisational Data Sources

Data is stored in various formats and covers a multitude of dimensions, from financial data to spatial information. Data captured in office productivity suites such as Microsoft Office and internal fit for purpose source systems do not lend themselves well for use directly in artificial intelligence systems.

The following lists familiar data sources; this list is by no means exhaustive:

Financial Data for ERP Accounting Systems (Oracle, SAP)
Spatial Data from GIS Systems (ESRI ArcGIS)
Spreadsheets from Office Productivity Tools (Microsoft Excel, Microsoft Access)
Custom SQL databases used behind source systems (Microsoft SQL, MySQL, Oracle, SAP)
Flat file databases captured in legacy systems (IBM Mainframes, Indexed files)

Different systems can store data in various formats. Datasets require joining; this presents challenges when there are multiple systems. It is common for data analysts to enter information manually using spreadsheets. The current trend is to input data into a data lake, so data engineers can work with it without having to interface directly with critical systems. Thus, it required data transformations to achieve goals.

Artificial Intelligence systems can make great use of this data. However, only when it is first processed in a suitable format to feed into such systems is where data lakes and data warehouses are crucial in producing high-quality datasets.

Artificial Intelligence and its relation to Extract-Transform-Load (ETL) processes

Traditional ETL processes will likely not change as artificial intelligence becomes more prominent. It is more likely that such techniques will be retargeted to produce datasets that are conducive for learning that work well with AI systems. An example is taking photos of objects and labelling them with an association to allow AI systems to learn.

There are great opportunities available to data scientists and data engineers to use their data preparation skill sets to build datasets for artificial intelligence systems. It will be important that ETL processes are automated and do not rely on manual processes to gain the most efficiency out of real-time artificial intelligence systems.

Data Lakes and Data Warehouses as a Single Source of Truth for use in Artificial Intelligence

Raw data stored across different systems results in fragmentation. To overcome this, it is desirable to pipe all data into one location, such as a relational database that allows queries and data manipulation. Once all the data is stored in one area, it can be more easily accessed and worked with to produce datasets that yield valuable information. It is essential to define a single source of truth.

Data Warehouses can then be defined using standards such as Kimball or Inmon to create dimensions to define facts or measures. A fact typical is categorical data, while a measure is typically numerical data in the general understanding. Processing data using such standards offer significant benefits in ensuring efficiency and accuracy.

Perhaps the most significant advantage for organisations that have invested in having a good data warehouse is that it opens up the organisational dataset to the broader organisation. Given large organisations, most employees do not have access to the critical source systems that run the business; however, they have access to the data warehouse, typically read-only. Data warehouses allow employees to identify insights that the organisation's management structure may not commonly know.

The creation of data warehouses ensures that privacy and regulatory considerations are defined. Data warehouses help ensure that data is safely transferred between stakeholders. Access to data lakes and warehouses can also improve transparency and accountability of how organisational functions are carried out, allowing for more stable operating procedures.

The Visualisation of Organisational Big Data

A challenge of big data is how to best view it and convey the story it tells. Earlier approaches included reporting services that aggregate the data from lower to higher levels to be displayed in standard charts such as bar charts, line charts, and scatter plots. These approaches are suitable for management reports (i.e. sales reports, account reports) that are part of daily business as usual operations. Microsoft SSRS is the most common tool used for enterprise-wide reporting.

Advanced visualisation programs emerged to address this gap, with Tableau and QlikView dominating the market. Tableau focused heavily on stunning visualisations, while QlikView managed to balance traditional reporting services such as Microsoft SSRS and Tableau. Microsoft PowerBI has dominated the market and is considered more complex by Gartner. These programs create dashboards that are incredibly useful for monitoring multiple key metrics and integrating such monitoring as part of comprehensive organisational processes. Strategic decision-makers have recently made great dashboards to make data-driven decisions, while operations managers can respond faster to achieve corporate objectives.

With the advent of AI, visualisation will play a vital role. The insights that Ariticical Intelligence systems can produce are complex, and it needs to be communicated in a visual representation that people can easily understand. An excellent example of this is presenting a self-organising map (SOM) to view multivariate data.

Coalescing Data to Feed into an Artificial Intelligence Systems

Given access to datasets, it is then possible to take data from a relational database and provide a connector to an artificial intelligence system. Most modern AI systems are built using Python and rely on modules typically implemented in C/C++ to ensure efficiency.

As Python is currently the primary tool to interface with AI, a rich series of data connectors are available to many different types of databases to access data. Further, Python lends itself well to data manipulations and further extends native functionality with rich libraries such as NumPy and Pandas to further help pre-process data fed into specific AI systems. Present frameworks are particular with the data formats that are accepted. Statically typed data frameworks can help with this. GPU processing requires specific data types, which are unlikely to change. Thus, data type considerations must be made ahead of time.

Artificial Intelligence systems in the narrow-AI field have specific data requirements, and it is worth taking the time to consider this during the planning stages of creating datasets to be used in said systems.

Capturing, Storing and Interpreting Artificial Intelligence Results

Artificial Intelligence systems, given the inputs, will consequently produce outputs that will need to be stored. More excitingly, the results can be fed back into the data lakes/data warehouses and continue the process of delivering insights as insights can yield further insights. Management of how outputs are stored will need to be considered carefully within a larger data governance framework.

Given that artificial intelligence systems process vast quantities of information, it is likely to find counter-intuitive insights that a human would typically miss. It is usually these insights that produce the most significant competitive advantage. Thus, organisations will have no choice but to engage with these systems as a means of remaining competitive.

Interpretation of artificial intelligence results will require careful consideration. Just like with current research today, it is possible that it will be misinterpreted. Therefore, data analysts must trace through all data points and backtrace why AI systems have yielded specific findings or risk incorrectly acting on insight. The use of data visualisation tools as described above can be applied to results generated by AI.

Moving into the coming decades, organisations will begin to rely upon information produced by AI systems and how the data underpinning these systems are governed and implemented will be of the utmost importance.

Preparing Organisational Data for use with AI

Artificial Intelligence Solutions Require High-Quality Data

Common Organisational Data Sources

Artificial Intelligence and its relation to Extract-Transform-Load (ETL) processes

Data Lakes and Data Warehouses as a Single Source of Truth for use in Artificial Intelligence

The Visualisation of Organisational Big Data

Coalescing Data to Feed into an Artificial Intelligence Systems

Capturing, Storing and Interpreting Artificial Intelligence Results

Artificial Intelligence vs Classical Algorithmic Programming

Organisational Applications Suited to AI

Preparing Organisational Data for use with AI

Integration and Productionisation of AI

Contact Us

Contact Form