Chapter 2: When More Data Isn't Always Better

Why More Data Isn't Always Better

Diminishing Returns of Big Data Volumes and the Curse of Dimensionality.

1. The Diminishing Returns of Big Data Volumes

Big data is “big” in several respects, including the volume of data points, the speed of data collection, and the variety of data sources. However, businesses can become caught up in amassing BIG volumes of data, which is unwise. Indeed, data volume does not affect innovation performance^[1] and has a negative effect on data veracity^[2] and firm performance (in most circumstances)^[3].

One reason for this is the difficulty of managing vast quantities of data and sorting out noise^[4]. Another is an inherent paradox of analysis: for every problem, there is a point at which bigger volumes of data stop producing better insights. Beyond this point, analyzing more of the same kind of data will only yield minimal, negligible improvements (e.g., 0.000000001%). See the figure below for a visual representation.

Figure 2: The Diminishing Returns of Big Data: Balancing Data Volume, Insight Quality, and Cost

Volume-hungry firms are, therefore, paying to acquire, store, and maintain unproductive data. Hence, most companies do not need more data; they need better data.

2. The Curse of Dimensionality

Understanding Dimensions in Data
In data analysis, the term "dimensions" refers to the variables or features that characterize data points within a dataset. For instance, in a customer dataset, dimensions may encompass attributes such as age, income, location, purchasing history, and product preferences. As the number of dimensions increases, the complexity of the data space escalates, leading to challenges in analysis and interpretation.

Imagine you're analyzing data to determine what makes the perfect pizza. Each characteristic of the pizza — such as crust type, sauce type, cheese amount, topping variety, topping quantity, bake time, and oven temperature — represents a separate dimension in your data. As you add more dimensions (like crust thickness, cheese blend, sauce acidity, and individual spices), the number of possible combinations increases exponentially. While you may think that including every minute detail will lead to better insights, too many dimensions can make it harder to identify the best combination for taste. The data space becomes sparse, and finding meaningful patterns across so many variables can lead to overfitting and noise rather than clear answers. Reducing the dimensions to the most critical factors, like crust type, cheese amount, and topping variety, can help simplify the analysis and reveal the combinations that customers prefer most efficiently.

While it may seem that adding more dimensions (or features) should always improve the accuracy of models, this is not necessarily the case. In fact, when the number of dimensions becomes very high, it leads to what is known as the curse of dimensionality^[5].

Why More Dimensions Can be Problematic
At first glance, collecting as many features as possible might seem beneficial—after all, more data should mean better insights, right? However, as the number of dimensions increases, the data becomes sparse. In a high-dimensional space, the distance between data points grows exponentially, making it harder to identify meaningful patterns. As a result, algorithms that work well in low-dimensional spaces struggle to distinguish between noise and actual trends in high- dimensional datasets.

This sparsity means that models require exponentially more data to achieve the same level of confidence and accuracy. Without sufficient data, the model may overfit, capturing noise rather than the underlying patterns, leading to poor generalization to new data. Additionally, high- dimensional datasets require significantly more computational power, making the process of training models slower and more expensive.

The Netflix Prize: When Less Can be More
One famous example illustrating the challenges of high-dimensional data is the Netflix Prize competition. In 2006, Netflix offered a $1 million prize to anyone who could improve its movie recommendation algorithm by at least 10%. Competitors had access to a massive dataset containing millions of ratings by users on various movies^[6]^[7]. While the competition saw many innovative approaches, one key takeaway was that adding more features (dimensions) did not always result in better predictions^[8]^[9].

Some teams tried to include every possible feature they could find—genres, actors, directors, release years, and even the day of the week when a rating was given. However, as they increased the number of features, their models became overly complex and prone to overfitting. In the end, teams that focused on selecting a smaller subset of the most relevant features, rather than using all the available data, often achieved better performance. The lesson from the Netflix Prize was clear: more data isn't always better, and sometimes reducing dimensional complexity leads to faster, more efficient, and even more accurate models^[10].

Reducing Dimensional Complexity: A Practical Approach
To overcome the curse of dimensionality, data scientists often use techniques to reduce the number of dimensions while retaining as much useful information as possible. One common method is Principal Component Analysis (PCA), which transforms the original variables into a smaller set of uncorrelated components, capturing the most variance in the data^[11]. By reducing the dimensionality, PCA can make models faster and more efficient, without sacrificing much predictive power.

Another example comes from marketing analytics, where businesses collect data on customer demographics, purchase history, browsing behavior, social media interactions, and more^[12]^[13]^[14]. With dozens or even hundreds of variables, the analysis can become overwhelming. By using dimensionality reduction techniques, businesses can focus on the most impactful variables, such as customer segments based on spending behavior or product preferences. This enables more efficient targeting of marketing campaigns, saving time and resources.

Balancing Data and Model Complexity
In some cases, focusing on fewer but more meaningful features can not only reduce computational costs but also improve model interpretability^[15]. For instance, in fraud detection, adding too many irrelevant features can dilute the signal needed to detect fraudulent behavior^[16]. By carefully selecting only the most relevant variables, businesses can build faster models that are easier to interpret, which is crucial when dealing with regulatory scrutiny.

When Less is More
The curse of dimensionality reminds us that more data isn't always better - especially if it means introducing irrelevant or redundant features that slow down model training, increase costs, and reduce predictive accuracy. Sometimes, reducing dimensional complexity is not only more efficient but also more effective. By focusing on the most valuable features, businesses can achieve faster, cheaper, and more accurate data-driven insights. In an era of big data, the ability to filter out noise and focus on what truly matters is more important than ever.

References

[1] Ghasemaghaei, M., & Calic, G. (2020). Assessing the impact of big data on firm innovation performance: Big data is not always better data. Journal of Business Research, 108, 147-162. https://doi.org/10.1016/j.jbusres.2019.09.062

[2] Ghasemaghaei, M. (2021). Understanding the impact of big data on firm performance: The necessity of conceptually differentiating among big data characteristics. International Journal of Information Management, 57, 102055. https://doi.org/10.1016/j.ijinfomgt.2019.10205

[3] Cappa, F., Oriani, R., Peruffo, E., & McCarthy, I. (2021). Big Data for Creating and Capturing Value in the Digitalized Environment: Unpacking the Effects of Volume, Variety, and Veracity on Firm Performance*. Journal of Product Innovation Management, 38(1), 49-67. https://doi.org/10.1111/jpim.12545

[4] Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347. https://doi.org/10.1016/j.ins.2014.01.015

[5] Su, Y., Huang, Y., & Kuo, C.-C. J. (2018). Efficient Text Classification Using Tree-Structured Multi- Linear Principal Component Analysis. 585-590. https://doi.org/10.1109/icpr.2018.8545832

[6] Bansal, S., & Sharma, R. (2024). Revealing the Evolution of Netflix Recommender Systems. 83- 86.

[7] Hallinan, B., & Striphas, T. (2016). Recommended for you: The Netflix Prize and the production of algorithmic culture. New Media & Society, 18(1), 117-137. https://doi.org/10.1177/1461444814538646

[8] Rendle, S., Zhang, L., & Koren, Y. (2019). On the difficulty of evaluating baselines: A study on recommender systems. arXiv Preprint arXiv:1905.01395.

[9] Steck, H., Baltrunas, L., Elahi, E., Liang, D., Raimond, Y., & Basilico, J. (2021). Deep learning for recommender systems: A Netflix case study. AI Magazine, 42(3), 7-18. https://doi.org/10.1609/aimag.v42i3.18140

[10] van Es, K. (2023). Netflix & big data: The strategic ambivalence of an entertainment company. Television & New Media, 24(6), 656-672. https://doi.org/10.1177/15274764221125745

[11] Maćkiewicz, A., & Ratajczak, W. (1993). Principal components analysis (PCA). Computers & Geosciences, 19(3), 303-342. https://doi.org/10.1016/0098-3004(93)90090-R

[12] Adeniran, I. A., Efunniyi, C. P., Osundare, O. S., & Abhulimen, A. (2024). Transforming marketing strategies with data analytics: A study on customer behavior and personalization. International Journal of Management & Entrepreneurship Research, 6(8).

[13] Corrigan, H. B., Craciun, G., & Powell, A. M. (2014). How does target know so much about its customers? Utilizing customer analytics to make marketing decisions. Marketing Education Review, 24(2), 159-166.

[14] Mudunuru, K. R., Remala, R., & Nagarajan, S. K. S. (2024). AI-Driven Data Analytics Unveiling Sales Insights from Demographics and Beyond.

[15] Linardatos, P., Papastefanopoulos, V., & Kotsiantis, S. (2020). Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1), 18. https://doi.org/10.3390/e23010018

[16] Shiguihara-Juarez, P., & Murrugarra-Llerena, N. (2019). Reducing Dimensionality of Variables for a Classification Problem: Fraud Detection. 1-4. https://doi.org/10.1109/SHIRCON48091.2019.9024863