Cognitive data mapping: machine learning to save you from tedious, time-consuming work

Exposed to a mass of heterogeneous data sources, collecting the relevant data in efficient and accurate manners is increasingly a key factor of success for financial institutions. This involves setting up a global schema in advance, corresponding to the field of interest, and then repeatedly mapping local schemas into it.



Xavier Zaegel - [Sponsoring] Partner - Financial Industry Solutions - Deloitte Tax & Consulting Sàrl

Fabian De Keyn - Director - Financial Industry Solutions - Deloitte Tax & Consulting Sàrl

Wen Qian - Senior Consultant - Financial Industry Solutions - Deloitte Tax & Consulting Sàrl

Published on 1 October 2019

Share this article

Exposed to a mass of heterogeneous data sources, collecting the relevant data in efficient and accurate manners is increasingly a key factor of success for financial institutions. This involves setting up a global schema in advance, corresponding to the field of interest, and then repeatedly mapping local schemas into it.

Manual processing with spreadsheets consumes a significant amount of time and labor, and is furthermore susceptible to human error. Not to mention that it deviates resources from higher added-value activities.

One can always resort to rules-based tools for automation and risk mitigation. However, rules about patterns and exceptions will pile up as the data scale continuously increases. At some point, further maintenance will become infeasible. What’s more, each new data source will come with its designated tool. Inefficiencies and costs abound.

The use of machine learning and Cognitive Data Mapping can help with the heavy lifting.

Zooming in on a concrete example

Supplying regulatory reporting service for funds, we have no control over how our clients prepare their data. “Fund Name” and “Issuer Name”, for instance, are terminologies applied in our templates, but not necessarily in clients’ files. The goal is to locate these two columns regardless of the column headers. More specifically, trained models should be capable of categorizing a given column as “Fund Name”, “Issuer Name”, or “unknown”, solely relying on its contents. For those who are familiar with machine learning, this is simply a multiclass classification problem.

Step 1 Feature extraction

A human would perform the task by capturing the presence of keywords. Machine learning algorithms do the same. However, instead of using knowledge and experience, predictions are inferred from labeled samples in the training dataset. Here we consider a small training dataset of 10 instances for each of the two classes. In practice, the size should be far greater.

Textual data must be converted into numerical inputs to be readable for machine learning algorithms . To quantify the data and identify critical words for making classification decisions (known as features), we compute for every single word, after removal of non-alphabetic characters and semantic duplicates, a score that represents its importance.

On one hand, words that appear more frequently are supposed to be stronger candidates for features. On the other hand, the more

common a term is across instances, the less informative it is. As illustrated, one will not be able to judge if an instance is a “Fund Name” or an “Issuer Name”, just by the occurrence of the term “DTT”. In other words, it is not a valid predictor. Inverse document fmeasures, IDFt = log(Total number of instances/Number of instances containing t), measures the uniqueness of terms.

Combining term frequency (TFt,i = Count of t in i/Total number of terms in i) with IDF makes an effective metric. The zero IDF score for term “DTT” drags it down to the bottom of the list despite its high TF score, and consequently disqualifies it as a feature.

Only a small selection of terms should be retained as features – the top 5 in our case – to avoid overfitting. This means that the model fits the training dataset very well, but generalizes poorly. By calculating the TF×IDF scores for each feature, we managed to represent any instance as a feature vector.

Step 2 - Model training

An intuitive way to resolve a multiclass classification problem is to decompose it into a series of independent binary classification problems, one per category. Such a strategy is known as One-vs-All, where every classifier returns either 1 or 0, to indicate whether or not an unseen instance belongs to the class in question. In case multiple labels are assigned to a single instance, the one with the highest confidence rating will dominate.

In our research, we adopted Logistic Regression for the implementation of the unit classifiers, considering its probabilistically interpretable nature and simplicity. Feature vectors extracted previously now play the role of independent variables. Model parameters are acquired through Maximum Likelihood Estimation.

Back to our example, when queried with two instances, “DTT Inc Group” and “DTT Sicav fund”, the “Issuer Name” classifier will answer “yes” and “no” respectively, according to probabilities of them being issuer names, 82% and 28% , derived from their feature vectors, [0% 73% 68% 0% 0%] and [60% 0% 0% 80% 0%].

Step 3 - Prediction making

This methodology can be applied to concrete solutions, by storing the trained models in binary format and using them to produce reports based on structured user data. Cognitive Data Mapping, once applied to this data, can generate information including confidence levels of mapped fields, unrecognized columns from the source data, and unmapped fields in the template. Human input can then be added to refine the findings by correcting any mismatches. Once the mappings have been verified, desired reports can be generated automatically via downstream functionalities.


Using machine learning engines, Cognitive Data Mapping overcomes the limitations of traditional approaches and offers a multitude of benefits.

  • Trimmed down operating costs & reduced operational risk: After the one-off deployment stage, which typically takes two to three weeks, obtaining each mapping structure only takes a few clicks. There is no compromise on accuracy, as domain experts are involved during model training, and they validate the outputs.

  • Optimized resource allocation: Data mapping generally takes up a lot of expertise. Outsourcing to AI frees up talents, so that they can focus on core business.

  • Sustained data capacity: In stark contrast to conventional methodologies, machine learning works better with scale. It only gets smarter and more agile as it is fed with more data. This enables organizations to keep up with explosive data growth.

  • Boosted business expansion: Adaptability to all kinds of data formats, column distributions and naming all ensure timely client onboarding. Thereby unlocking a huge upside potential.

  • Enhanced data normalization & centralization: Disparate data provides limited value until it is connected, aggregated and analyzed as a whole. Database migration could be overwhelming without a flexible system to handle the various data silos.

Share #DeloitteInsideNOW

Quantitative Finance Master Classes

The Deloitte Quantitative Master Classes are designed for in-depth training dedicated to practitioners: the world of quantitative finance applied to the real day-to-day world. The classes are addressing practical cases of implementation of real life problems.

© 2019. See Terms of Use for more information.

Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee (“DTTL”), its network of member firms, and their related entities. DTTL and each of its member firms are legally separate and independent entities. DTTL (also referred to as “Deloitte Global”) does not provide services to clients. Please see to learn more about our global network of member firms.

The Luxembourg member firm of Deloitte Touche Tohmatsu Limited Privacy Statement notice may be found at