Why More Data Isn't Always Better
Diminishing Returns of Big Data Volumes and the Curse of Dimensionality.
1. The Diminishing Returns of Big Data Volumes
Big data is “big” in several respects, including the volume of data points, the speed of data collection, and the variety of data sources. However, businesses can become caught up in amassing BIG volumes of data, which is unwise. Indeed, data volume does not affect innovation performance[1] and has a negative effect on data veracity[2] and firm performance (in most circumstances)[3].One reason for this is the difficulty of managing vast quantities of data and sorting out noise[4]. Another is an inherent paradox of analysis: for every problem, there is a point at which bigger volumes of data stop producing better insights. Beyond this point, analyzing more of the same kind of data will only yield minimal, negligible improvements (e.g., 0.000000001%). See the figure below for a visual representation.

Figure 2: The Diminishing Returns of Big Data: Balancing Data Volume, Insight Quality, and Cost
Volume-hungry firms are, therefore, paying to acquire, store, and maintain unproductive data. Hence, most companies do not need more data; they need better data.
2. The Curse of Dimensionality
Understanding Dimensions in Data
In data analysis, the term "dimensions" refers to the variables or features that characterize data
points within a dataset. For instance, in a customer dataset, dimensions may encompass
attributes such as age, income, location, purchasing history, and product preferences. As the
number of dimensions increases, the complexity of the data space escalates, leading to
challenges in analysis and interpretation.
Imagine you're analyzing data to determine what makes the perfect pizza. Each characteristic of the pizza — such as crust type, sauce type, cheese amount, topping variety, topping quantity, bake time, and oven temperature — represents a separate dimension in your data. As you add more dimensions (like crust thickness, cheese blend, sauce acidity, and individual spices), the number of possible combinations increases exponentially. While you may think that including every minute detail will lead to better insights, too many dimensions can make it harder to identify the best combination for taste. The data space becomes sparse, and finding meaningful patterns across so many variables can lead to overfitting and noise rather than clear answers. Reducing the dimensions to the most critical factors, like crust type, cheese amount, and topping variety, can help simplify the analysis and reveal the combinations that customers prefer most efficiently.
While it may seem that adding more dimensions (or features) should always improve the accuracy of models, this is not necessarily the case. In fact, when the number of dimensions becomes very high, it leads to what is known as the curse of dimensionality[5].
Why More Dimensions Can be Problematic
At first glance, collecting as many features as possible might seem beneficial—after all, more
data should mean better insights, right? However, as the number of dimensions increases, the
data becomes sparse. In a high-dimensional space, the distance between data points grows
exponentially, making it harder to identify meaningful patterns. As a result, algorithms that work
well in low-dimensional spaces struggle to distinguish between noise and actual trends in high-
dimensional datasets.
This sparsity means that models require exponentially more data to achieve the same level of confidence and accuracy. Without sufficient data, the model may overfit, capturing noise rather than the underlying patterns, leading to poor generalization to new data. Additionally, high- dimensional datasets require significantly more computational power, making the process of training models slower and more expensive.
The Netflix Prize: When Less Can be More
One famous example illustrating the challenges of high-dimensional data is the Netflix Prize
competition. In 2006, Netflix offered a $1 million prize to anyone who could improve its movie
recommendation algorithm by at least 10%. Competitors had access to a massive dataset
containing millions of ratings by users on various movies[6][7]. While the competition saw many innovative approaches, one key takeaway was
that adding more features (dimensions) did not always result in better predictions[8][9].
Some teams tried to include every possible feature they could find—genres, actors, directors, release years, and even the day of the week when a rating was given. However, as they increased the number of features, their models became overly complex and prone to overfitting. In the end, teams that focused on selecting a smaller subset of the most relevant features, rather than using all the available data, often achieved better performance. The lesson from the Netflix Prize was clear: more data isn't always better, and sometimes reducing dimensional complexity leads to faster, more efficient, and even more accurate models[10].
Reducing Dimensional Complexity: A Practical Approach
To overcome the curse of dimensionality, data scientists often use techniques to reduce the
number of dimensions while retaining as much useful information as possible. One common
method is Principal Component Analysis (PCA), which transforms the original variables into a
smaller set of uncorrelated components, capturing the most variance in the data[11]. By reducing the dimensionality, PCA can make models faster and more
efficient, without sacrificing much predictive power.
Another example comes from marketing analytics, where businesses collect data on customer demographics, purchase history, browsing behavior, social media interactions, and more[12][13][14]. With dozens or even hundreds of variables, the analysis can become overwhelming. By using dimensionality reduction techniques, businesses can focus on the most impactful variables, such as customer segments based on spending behavior or product preferences. This enables more efficient targeting of marketing campaigns, saving time and resources.
Balancing Data and Model Complexity
In some cases, focusing on fewer but more meaningful features can not only reduce
computational costs but also improve model interpretability[15]. For instance,
in fraud detection, adding too many irrelevant features can dilute the signal needed to detect
fraudulent behavior[16]. By carefully selecting only
the most relevant variables, businesses can build faster models that are easier to interpret, which
is crucial when dealing with regulatory scrutiny.
When Less is More
The curse of dimensionality reminds us that more data isn't always better - especially if it
means introducing irrelevant or redundant features that slow down model training, increase costs,
and reduce predictive accuracy. Sometimes, reducing dimensional complexity is not only more
efficient but also more effective. By focusing on the most valuable features, businesses can
achieve faster, cheaper, and more accurate data-driven insights. In an era of big data, the ability
to filter out noise and focus on what truly matters is more important than ever.