Wednesday, January 22, 2025

Unlocking the Full Potential of Data Scientists – O’Reilly

Modern organizations regard data as a strategic asset that drives efficiency, enhances decision making, and creates new value for customers. Across the organization—product management, marketing, operations, finance, and more—teams are overflowing with ideas on how data can elevate the business. To bring these ideas to life, companies are eagerly hiring data scientists for their technical skills (Python, statistics, machine learning, SQL, etc.).

Despite this enthusiasm, many companies are significantly underutilizing their data scientists. Organizations remain narrowly focused on employing data scientists to execute preexisting ideas, overlooking the broader value they bring. Beyond their skills, data scientists possess a unique perspective that allows them to come up with innovative business ideas of their own—ideas that are novel, strategic, or differentiating and are unlikely to come from anyone but a data scientist.


Learn faster. Dig deeper. See farther.

Misplaced Focus on Skills and Execution

Sadly, many companies behave in ways that suggest they are uninterested in the ideas of data scientists. Instead, they treat data scientists as a resource to be used for their skills alone. Functional teams provide requirements documents with fully specified plans: “Here’s how you are to build this new system for us. Thank you for your partnership.” No context is provided, and no input is sought—other than an estimate for delivery. Data scientists are further inundated with ad hoc requests for tactical analyses or operational dashboards.1 The backlog of requests grows so large that the work queue is managed through Jira-style ticketing systems, which strip the requests of any business context (e.g., “get me the top products purchased by VIP customers”). One request begets another,2 creating a Sisyphean endeavor that leaves no time for data scientists to think for themselves. And then there’s the myriad of opaque requests for data pulls: “Please get me this data so I can analyze it.” This is marginalizing—like asking Steph Curry to pass the ball so you can take the shot. It’s not a partnership; it’s a subordination that reduces data science to a mere support function, executing ideas from other teams. While executing tasks may produce some value, it won’t tap into the full potential of what data scientists truly have to offer.

It’s the Ideas

The untapped potential of data scientists lies not in their ability to execute requirements or requests but in their ideas for transforming a business. By “ideas” I mean new capabilities or strategies that can move the business in better or new directions—leading to increased3 revenue, profit, or customer retention while simultaneously providing a sustainable competitive advantage (i.e., capabilities or strategies that are difficult for competitors to replicate). These ideas often take the form of machine learning algorithms that can automate decisions within a production system.4 For example, a data scientist might develop an algorithm to better manage inventory by optimally balancing overage and underage costs. Or they might create a model that detects hidden customer preferences, enabling more effective personalization. If these sound like business ideas, that’s because they are—but they’re not likely to come from business teams. Ideas like these typically emerge from data scientists, whose unique cognitive repertoires and observations in the data make them well-suited to uncovering such opportunities.

Ideas That Leverage Unique Cognitive Repertoires

A cognitive repertoire is the range of tools, strategies, and approaches an individual can draw upon for thinking, problem-solving, or processing information (Page 2017). These repertoires are shaped by our backgrounds—education, experience, training, and so on. Members of a given functional team often have similar repertoires due to their shared backgrounds. For example, marketers are taught frameworks like SWOT analysis and ROAS, while finance professionals learn models such as ROIC and Black-Scholes.

Data scientists have a distinctive cognitive repertoire. While their academic backgrounds may vary—ranging from statistics to computer science to computational neuroscience—they typically share a quantitative tool kit. This includes frameworks for widely applicable problems, often with accessible names like the “newsvendor model,” the “traveling salesman problem,” the “birthday problem,” and many others. Their tool kit also includes knowledge of machine learning algorithms5 like neural networks, clustering, and principal components, which are used to find empirical solutions to complex problems. Additionally, they include heuristics such as big O notation, the central limit theorem, and significance thresholds. All of these constructs can be expressed in a common mathematical language, making them easily transferable across different domains, including business—perhaps especially business.

The repertoires of data scientists are particularly relevant to business innovation since, in many industries,6 the conditions for learning from data are nearly ideal in that they have high-frequency events, a clear objective function,7 and timely and unambiguous feedback. Retailers have millions of transactions that produce revenue. A streaming service sees millions of viewing events that signal customer interest. And so on—millions or billions of events with clear signals that are revealed quickly. These are the units of induction that form the basis for learning, especially when aided by machines. The data science repertoire, with its unique frameworks, machine learning algorithms, and heuristics, is remarkably geared for extracting knowledge from large volumes of event data.

Ideas are born when cognitive repertoires connect with business context. A data scientist, while attending a business meeting, will regularly experience pangs of inspiration. Her eyebrows raise from behind her laptop as an operations manager describes an inventory perishability problem, lobbing the phrase “We need to buy enough, but not too much.” “Newsvendor model,” the data scientist whispers to herself. A product manager asks, “How is this process going to scale as the number of products increases?” The data scientist involuntarily scribbles “O(N2)” on her notepad, which is big O notation to indicate that the process will scale superlinearly. And when a marketer brings up the topic of customer segmentation, bemoaning, “There are so many customer attributes. How do we know which ones are most important?,” the data scientist sends a text to cancel her evening plans. Instead, tonight she will eagerly try running principal components analysis on the customer data.8

No one was asking for ideas. This was merely a tactical meeting with the goal of reviewing the state of the business. Yet the data scientist is practically goaded into ideating. “Oh, oh. I got this one,” she says to herself. Ideation can even be hard to suppress. Yet many companies unintentionally seem to suppress that creativity. In reality our data scientist probably wouldn’t have been invited to that meeting. Data scientists are not typically invited to operating meetings. Nor are they typically invited to ideation meetings, which are often limited to the business teams. Instead, the meeting group will assign the data scientist Jira tickets of tasks to execute. Without the context, the tasks will fail to inspire ideas. The cognitive repertoire of the data scientist goes unleveraged—a missed opportunity to be sure.

Ideas Born from Observation in the Data

Beyond their cognitive repertoires, data scientists bring another key advantage that makes their ideas uniquely valuable. Because they are so deeply immersed in the data, data scientists discover unforeseen patterns and insights that inspire novel business ideas. They are novel in the sense that no one would have thought of them—not product managers, executives, marketers—not even a data scientist for that matter. There are many ideas that cannot be conceived of but rather are revealed by observation in the data.

Company data repositories (data warehouses, data lakes, and the like) contain a primordial soup of insights lying fallow in the information. As they do their work, data scientists often stumble upon intriguing patterns—an odd-shaped distribution, an unintuitive relationship, and so forth. The surprise finding piques their curiosity, and they explore further.

Imagine a data scientist doing her work, executing on an ad hoc request. She is asked to compile a list of the top products purchased by a particular customer segment. To her surprise, the products bought by the various segments are hardly different at all. Most products are bought at about the same rate by all segments. Weird. The segments are based on profile descriptions that customers opted into, and for years the company had assumed them to be meaningful groupings useful for managing products. “There must be a better way to segment customers,” she thinks. She explores further, launching an informal, impromptu analysis. No one is asking her to do this, but she can’t help herself. Rather than relying on the labels customers use to describe themselves, she focuses on their actual behavior: what products they click on, view, like, or dislike. Through a combination of quantitative techniques—matrix factorization and principal component analysis—she comes up with a way to place customers into a multidimensional space. Clusters of customers adjacent to one another in this space form meaningful groupings that better reflect customer preferences. The approach also provides a way to place products into the same space, allowing for distance calculations between products and customers. This can be used to recommend products, plan inventory, target marketing campaigns, and many other business applications. All of this is inspired from the surprising observation that the tried-and-true customer segments did little to explain customer behavior. Solutions like this have to be driven by observation since, absent the data saying otherwise, no one would have thought to inquire about a better way to group customers.

As a side note, the principal component algorithm that the data scientists used belongs to a class of algorithms called “unsupervised learning,” which further exemplifies the concept of observation-driven insights. Unlike “supervised learning,” in which the user instructs the algorithm what to look for, an unsupervised learning algorithm lets the data describe how it is structured. It is evidence based; it quantifies and ranks each dimension, providing an objective measure of relative importance. The data does the talking. Too often we try to direct the data to yield to our human-conceived categorization schemes, which are familiar and convenient to us, evoking visceral and stereotypical archetypes. It’s satisfying and intuitive but often flimsy and fails to hold up in practice.

Examples like this are not rare. When immersed in the data, it’s hard for the data scientists not to come upon unexpected findings. And when they do, it’s even harder for them to resist further exploration—curiosity is a powerful motivator. Of course, she exercised her cognitive repertoire to do the work, but the entire analysis was inspired by observation of the data. For the company, such distractions are a blessing, not a curse. I’ve seen this sort of undirected research lead to better inventory management practices, better pricing structures, new merchandising strategies, improved user experience designs, and many other capabilities—none of which were asked for but instead were discovered by observation in the data.

Isn’t discovering new insights the data scientist’s job? Yes—that’s exactly the point of this article. The problem arises when data scientists are valued only for their technical skills. Viewing them solely as a support team limits them to answering specific questions, preventing deeper exploration of insights in the data. The pressure to respond to immediate requests often causes them to overlook anomalies, unintuitive results, and other potential discoveries. If a data scientist were to suggest some exploratory research based on observations, the response is almost always, “No, just focus on the Jira queue.” Even if they spend their own time—nights and weekends—researching a data pattern that leads to a promising business idea, it may still face resistance simply because it wasn’t planned or on the roadmap. Roadmaps tend to be rigid, dismissing new opportunities, even valuable ones. In some organizations, data scientists may pay a price for exploring new ideas. Data scientists are often judged by how well they serve functional teams, responding to their requests and fulfilling short-term needs. There is little incentive to explore new ideas when doing so detracts from a performance review. In reality, data scientists frequently find new insights in spite of their jobs, not because of them.

Ideas That Are Different

These two things—their cognitive repertoires and observations from the data—make the ideas that come from data scientists uniquely valuable. This is not to suggest that their ideas are necessarily better than those from the business teams. Rather, their ideas are different from those of the business teams. And being different has its own set of benefits.

Having a seemingly good business idea doesn’t guarantee that the idea will have a positive impact. Evidence suggests that most ideas will fail. When properly measured for causality,9 the vast majority of business ideas either fail to show any impact at all or actually hurt metrics. (See some statistics here.) Given the poor success rates, innovative companies construct portfolios of ideas in the hopes that at least a few successes will allow them to reach their goals. Still savvier companies use experimentation10 (A/B testing) to try their ideas on small samples of customers, allowing them to assess the impact before deciding to roll them out more broadly.

This portfolio approach, combined with experimentation, benefits from both the quantity and diversity of ideas.11 It’s similar to diversifying a portfolio of stocks. Increasing the number of ideas in the portfolio increases exposure to a positive outcome—an idea that makes a material positive impact on the company. Of course, as you add ideas, you also increase the risk of bad outcomes—ideas that do nothing or even have a negative impact. However, many ideas are reversible—the “two-way door” that Amazon’s Jeff Bezos speaks of (Haden 2018). Ideas that don’t produce the expected results can be pruned after being tested on a small sample of customers, greatly mitigating the impact, while successful ideas can be rolled out to all relevant customers, greatly amplifying the impact.

So, adding ideas to the portfolio increases exposure to upside without a lot of downside—the more, the better.12 However, there is an assumption that the ideas are independent (uncorrelated). If all the ideas are similar, then they may all succeed or fail together. This is where diversity comes in. Ideas from different groups will leverage divergent cognitive repertoires and different sets of information. This makes them different and less likely to be correlated with each other, producing more varied outcomes. For stocks, the return on a diverse portfolio will be the average of the returns for the individual stocks. However, for ideas, since experimentation lets you mitigate the bad ones and amplify the good ones, the return of the portfolio can be closer to the return of the best idea (Page 2017).

In addition to building a portfolio of diverse ideas, a single idea can be significantly strengthened through collaboration between data scientists and business teams.13 When they work together, their combined repertoires fill in each other’s blind spots (Page 2017).14 By merging the unique expertise and insights from multiple teams, ideas become more robust, much like how diverse groups tend to excel in trivia competitions. However, organizations must ensure that true collaboration happens at the ideation stage rather than dividing responsibilities such that business teams focus solely on generating ideas and data scientists are relegated to execution.

Cultivating Ideas

Data scientists are much more than a skilled resource for executing existing ideas; they are a wellspring of novel, innovative thinking. Their ideas are uniquely valuable because (1) their cognitive repertoires are highly relevant to businesses with the right conditions for learning, (2) their observations in the data can lead to novel insights, and (3) their ideas differ from those of business teams, adding diversity to the company’s portfolio of ideas.

However, organizational pressures often prevent data scientists from fully contributing their ideas. Overwhelmed with skill-based tasks and deprived of business context, they are incentivized to merely fulfill the requests of their partners. This pattern exhausts the team’s capacity for execution while leaving their cognitive repertoires and insights largely untapped.

Here are some suggestions that organizations can follow to better leverage data scientists and shift their roles from mere executors to active contributors of ideas:

  • Give them context, not tasks. Providing data scientists with tasks or fully specified requirements documents will get them to do work, but it won’t elicit their ideas. Instead, give them context. If an opportunity is already identified, describe it broadly through open dialogue, allowing them to frame the problem and propose solutions. Invite data scientists to operational meetings where they can absorb context, which may inspire new ideas for opportunities that haven’t yet been considered.
  • Create slack for exploration. Companies often completely overwhelm data scientists with tasks. It may seem paradoxical, but keeping resources 100% utilized is very inefficient.15 Without time for exploration and unexpected learning, data science teams can’t reach their full potential. Protect some of their time for independent research and exploration, using tactics like Google’s 20% time or similar approaches.
  • Eliminate the task management queue. Task queues create a transactional, execution-focused relationship with the data science team. Priorities, if assigned top-down, should be given in the form of general, unframed opportunities that need real conversations to provide context, goals, scope, and organizational implications. Priorities might also emerge from within the data science team, requiring support from functional partners, with the data science team providing the necessary context. We don’t assign Jira tickets to product or marketing teams, and data science should be no different.
  • Hold data scientists accountable for real business impact. Measure data scientists by their impact on business outcomes, not just by how well they support other teams. This gives them the agency to prioritize high-impact ideas, regardless of the source. Additionally, tying performance to measurable business impact16 clarifies the opportunity cost of low-value ad hoc requests.17
  • Hire for adaptability and broad skill sets. Look for data scientists who thrive in ambiguous, evolving environments where clear roles and responsibilities may not always be defined. Prioritize candidates with a strong desire for business impact,18 who see their skills as tools to drive outcomes, and who excel at identifying new opportunities aligned with broad company goals. Hiring for diverse skill sets enables data scientists to build end-to-end systems, minimizing the need for handoffs and reducing coordination costs—especially critical during the early stages of innovation when iteration and learning are most important.19
  • Hire functional leaders with growth mindsets. In new environments, avoid leaders who rely too heavily on what worked in more mature settings. Instead, seek leaders who are passionate about learning and who value collaboration, leveraging diverse perspectives and information sources to fuel innovation.

These suggestions require an organization with the right culture and values. The culture needs to embrace experimentation to measure the impact of ideas and to recognize that many will fail. It needs to value learning as an explicit goal and understand that, for some industries, the vast majority of knowledge has yet to be discovered. It must be comfortable relinquishing the clarity of command-and-control in exchange for innovation. While this is easier to achieve in a startup, these suggestions can guide mature organizations toward evolving with experience and confidence. Shifting an organization’s focus from execution to learning is a challenging task, but the rewards can be immense or even crucial for survival. For most modern firms, success will depend on their ability to harness human potential for learning and ideation—not just execution (Edmondson 2012). The untapped potential of data scientists lies not in their ability to execute existing ideas but in the new and innovative ideas no one has yet imagined.


Footnotes

  1. To be sure, dashboards have value in providing visibility into business operations. However, dashboards are limited in their ability to provide actionable insights. Aggregated data is typically so full of confounders and systemic bias that it is rarely appropriate for decision making. The resources required to build and maintain dashboards need to be balanced against other initiatives the data science team could be doing that might produce more impact.
  2. It’s a well-known phenomenon that data-related inquiries tend to evoke more questions than they answer.
  3. I used “increased” in place of “incremental” since the latter is associated with “small” or “marginal.” The impact from data science initiatives can be substantial. I use the term here to indicate the impact as an improvement—though without a fundamental change to the existing business model.
  4. As opposed to data used for human consumption, such as short summaries or dashboards, which do have value in that they inform our human workers but are typically limited in direct actionability.
  5. I resist referring to knowledge of the various algorithms as skills since I feel it’s more important to emphasize their conceptual appropriateness for a given situation versus the pragmatics of training or implementing any particular approach.
  6. Industries such as ecommerce, social networks, and streaming content have favorable conditions for learning in comparison to fields like medicine, where the frequency of events is much lower and the time to feedback is much longer. Additionally, in many aspects of medicine, the feedback can be very ambiguous.
  7. Typically revenue, profit, or user retention. However, it can be challenging for a company to identify a single objective function.
  8. Voluntary tinkering is common among data scientists and is driven by curiosity, the desire for impact, the desire for experience, etc.
  9. Admittedly, the data available on the success rates of business ideas is likely biased in that most of it comes from tech companies experimenting with online services. However, at least anecdotally, the low success rates seem to be consistent across other types of business functions, industries, and domains.
  10. Not all ideas are conducive to experimentation due to unattainable sample size, inability to isolate experimentation arms, ethical concerns, or other factors.
  11. I purposely exclude the notion of “quality of idea” since, in my experience, I’ve seen little evidence that an organization can discern the “better” ideas within the pool of candidates.
  12. Often, the real cost of developing and trying an idea is the human resources—engineers, data scientists, PMs, designers, etc. These resources are fixed in the short term and act as a constraint to the number of ideas that can be tried in a given time period.
  13. See Duke University professor Martin Ruef, who studied the coffee house model of innovation (coffee house is analogy for bringing diverse people together to chat). Diverse networks are 3x more innovative than linear networks (Ruef 2002).
  14. The data scientists will appreciate the analogy to ensemble models, where errors from individual models can offset each other.
  15. See The Goal, by Eliyahu M. Goldratt, which articulates this point in the context of supply chains and manufacturing lines. Maintaining resources at a level above the current needs enables the firm to take advantage of unexpected surges in demand, which more than pays for itself. The practice works for human resources as well.
  16. Causal measurement via randomized controlled trials is ideal, to which algorithmic capabilities are very amenable.
  17. Admittedly, the value of an ad hoc request is not always clear. But there should be a high bar to consume data science resources. A Jira ticket is far too easy to submit. If a topic is important enough, it will merit a meeting to convey context and opportunity.
  18. If you are reading this and find yourself skeptical that your data scientist who spends his time dutifully responding to Jira tickets is capable of coming up with a good business idea, you are likely not wrong. Those comfortable taking tickets are probably not innovators or have been so inculcated to a support role that they have lost the will to innovate.
  19. As the system matures, more specialized resources can be added to make the system more robust. This can create a scramble. However, by finding success first, we are more judicious with our precious development resources.

References

  1. Page, Scott E. 2017. The Diversity Bonus. Princeton University Press.
  2. Edmondson, Amy C. 2012. Teaming: How Organizations Learn, Innovate, and Compete in the Knowledge Economy. Jossey-Bass.
  3. Haden, Jeff. 2018. “Amazon Founder Jeff Bezos: This Is How Successful People Make Such Smart Decisions.” Inc., December 3. https://www.inc.com/jeff-haden/amazon-founder-jeff-bezos-this-is-how-successful-people-make-such-smart-decisions.html.
  4. Ruef, Martin. 2002. “Strong Ties, Weak Ties and Islands: Structural and Cultural Predictors of Organizational Innovation.” Industrial and Corporate Change 11 (3): 427–449. https://doi.org/10.1093/icc/11.3.427.


Related Articles

Latest Articles