The importance of dataset evaluation for AI development

Its founder and CEO Gravy Analyticsleading provider of business location information.

One of the biggest challenges in developing and scaling artificial intelligence is ensuring that the data used to train AI algorithms is accurate and timely. Using bad data hinders an organization’s ability to predict future trends and make meaningful business decisions.

Only when high-quality datasets drive AI algorithms can businesses gain valuable insights and insights that can help them make critical decisions that could improve customer experience, brand loyalty, supply chain flow And much more.

At its core, an accurate data set provides a representative sample of the population. With 97% of the population owning some sort of mobile device and 85% owning a smartphone, location data is well placed to provide an accurate record of a given population’s activities, especially when used to supplement other data sources.

Location data can be used to gather information about many different parts of a business, such as in-store traffic, consumer purchasing behaviors, or supply chain performance. Insights derived from location data can provide businesses with up-to-date, real-world context for how their customers interact with their company in the physical world.

Additionally, businesses can use this information to enable trend tracking and data-driven support for everything from monitoring their supply chain operations to determining the location of their next store. Ensuring that these algorithms are not biased based on the data sets used to train them is of the utmost importance, not only for the accuracy of the results but also for the reputation of the business.

How to evaluate data sets

The overall AI category is projected to grow more than 25% by 2026. Industries that want to adopt these predictive systems must ensure that data-savvy teams are implementing them. Unfortunately, not every company has data engineers or data scientists to help assess the quality of data selected for machine learning applications. When evaluating data, businesses should consider four characteristics.


Do you know the source of the data? Can your data provider ensure it is not fraudulent? Can you slice the dataset for your analysis? What features does the data set include? These questions help determine whether a particular data set is reliable and can provide information.


Has the data been verified for accuracy? How was it qualified for inclusion in this particular data set? Does the dataset include tags and metadata to aid analysis? These questions are key to ensuring your algorithm returns accurate and relevant results.


Is the data set large enough to authentically represent the desired population and customer base? This question helps verify that the source will include sufficient data to be scientifically reliable.


When was the data collected? How often is it updated or refreshed? What checks are in place to remove old or outdated data? These questions are key to information assurance, and decisions based on them are still relevant, especially when you consider how many iterations of consumer behavior change there have been in the last couple of years.

Near-real-time data is vital, so you don’t train AI on old, outdated data. In fact, a McKinsey survey found that nearly a third (32%) of sales and marketing executives who adopted AI during the Covid pandemic said their machine learning models failed because they were based on pre-pandemic data.

Avoiding bias

Once the third-party data set is taken and merged with an organization’s proprietary data, information can flow.

When used in aggregate, location data can be instrumental in a wide range of business scenarios. For example, responsible location data providers are building datasets based on important consumer personas to help brands train AI on real consumer interest and purchase intent, ranging from groups such as foodies and frequent shoppers to shoppers in-market cars, retirees and new homeowners. Using information derived from this information, a coffee chain may decide to open a new location near a train station in the city center after the analysis showed a resurgence of foot traffic during peak hours in that city.

When analyzing the actual activities of your target audience, the data set should be fully representative of the population. Amazon, for example, was forced to scrap an AI-enabled recruiting platform when it was found to be biased against women because the datasets used to train it came from the company’s own hiring records. Amazon’s employment data is predominantly male-skewed, as more men than women traditionally apply for and accept positions in the tech industry, leading the algorithm to falsely reject female applicants as suitable candidates. To avoid biases in these scenarios, it is best to continue to improve and adapt this technology so that it does not actively create these biases as it learns from your data.

Other Considerations

Fully representative datasets are the holy grail for machine learning. It is important to note, however, that care must be taken to ensure that information about specific individuals remains anonymous. Aggregate data is key here. Stripping personally identifiable information from data that fully represents a population before using it to train AI ensures that unconscious bias does not enter the algorithm and personal data is protected.

It is important for organizations to consider the accuracy of the location data they source, including whether the data is fully analyzed, validated and categorized. This includes investing time up front to ask the right questions about the source and quality of the data set and ensuring that it can be sliced ​​and diced in a way that will generate the necessary information.

When the data is thoroughly evaluated and verified for accuracy, location data combined with artificial intelligence can give companies the ability to better understand their operations and customers in near real-time. With this information, organizations can identify their target audience and determine how best to interact with them. They can generate ideas for new products, services and locations, monitor their supply chain operations and more. With location data feeding AI algorithms, there are many insights and benefits companies can gain.

The Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Am I eligible?

Leave a Reply

Your email address will not be published. Required fields are marked *