Building a team of “Citizen Data Scientists” – a practical experience view
Unless you are allergic to every new hype word coming out of Gartner and other high profile research firms, you have probably heard the term “Citizen Data Scientist” a few times over the past 12 months. It is attributed to Gartner’s director and BI industry expert Alexander Linden, who suggested that companies interested in data science should be:
cultivating “citizen data scientists”—people on the business side that may have some data skills, possibly from a math or even social science degree—and putting them to work exploring and analyzing data.
His point was that advanced data analytic tools are becoming ubiquitous and easier to use. However, people with highly specialized skills and deep background in machine learning, statistics and other quantitative sciences, the so-called “Data Scientist”, are still hard to find and expensive. Therefore Linden suggested that soon we will see the emergence of business people without a traditional “white coat data scientist” profile taking advantage of advanced analytics in data-focused organizations.
This is an interesting concept indeed, which has been received with mixed reviews. While a recent article Forbes called it the “Democratization of Big Data”, on more technical forums it has been referred as a “mirage“. Not everyone agrees that its possible, wise and generally a good idea to take people that are not “qualified” to crunch numbers by training and handing them tools with intimidating names such as Support Vector Machines, Decision Trees, Neural Networks and Principal Components Analysis. Most importantly, is it a good idea to entrust such “citizens” with decisions that only until know were strictly the responsibility of Ph.D.-bearing, Data Masters that are as hard to find and expensive as truffles?
Before I lay out my starting position on the topic of citizen data scientists, full disclosure: I am one of those fellows that used to be called a “Knowledge Discovery Engineer” in the 90s, a “Data Miner” a few years later and a “Data Scientist” over the past 3 or 4 years. But personally I do not feel intimidated, nor offended, by the idea that you may not need a Ph.D. to be a decent Data Scientist, or at least a person capable of making sound use of data. Actually, I have personally helped organizations in doing exactly that, several times, over the past 10 years. Based on that experience, I’m offering this practitioner’s view into the making of the “Citizen Data Scientist”.
Some organizations have simply no options
The main reason for having had several opportunities to help build teams of citizen data scientist is probably the vertical where we started with 13 years ago: government, and specifically tax and revenue.
Anyone would agree that a tax agency has quite a bit of data to crunch and the important mandate to ensure compliance, which advanced analytics can definitively help with. Small incremental improvements in detecting fraud, noncompliance and managing tax collection can easily turn in to millions, if not billions of dollars, in additional revenue to support our communities. However, tax agencies, like any government institutions, are not typically on the short list of top university graduates with advanced degrees in statistics, machine learning or computer science. While I can ensure you that the work would be challenging, on the pay scale it’s hard for many government agencies to compete with Silicon Valley or Wall Street.
Many private companies may be in the same situation. Albeit they could be able to pay the high salaries that the best data scientists demand, they may not be as attractive as technology companies for which data science is a core competency. Also, private companies may have a hard time selecting the right candidates simply because they do not know what skills to look for and how to evaluate them.
Thus, for many organizations the idea of enabling smart, data-driven employees who know the business very well and are willing to learn new skills is definitively a compelling path, and possibly the only path. The good news is that, in our experience, it can work – with some caution and considerable patience and investment.
Learning by doing is the key
Over a decade ago now, one of our first clients, a national tax agency, decided to assemble their first “Data Mining” team to deploy predictive analytics for tax collection and filing enforcement (the process to identify and “qualify” people or businesses that failed to file). The ranks of candidates included former collectors, auditors, business analysts and some IT folks with prior Business Intelligence (BI) and database skills. Education ranged from degrees in education or social sciences to MBAs to computer science. None of them had done significant coursework in statistics or math. Thus, our task was to turn the “improbable army” of candidate data miners (the term was still in vogue at the time), into an effective team capable of designing, evaluating and deploying predictive models.
Armed with a stack of licenses for a visual-programming-oriented statistics tool and plenty of data in our newly created data mart, we ventured with our team of enthusiastic yet somewhat frighten data miner recruits into a series of actual modeling projects. We started with real goals, we worked hard side-by-side, letting them “drive” and serving as dedicated co-pilots. We explained what needed to be explained at the right time. Initially, it was mainly about familiarizing the team with the tools, and even teaching the basics of the database queries behind the major data integration steps we were implementing. Without some basic understanding of SQL, indexes and primary keys, it is easy for a newbie to get stuck right at the start, overwhelmed by data volumes and query response lags. But the citizen data scientist has to learn the ropes and be capable of starting from raw data, working his or her way up to data understanding, pattern understanding and predictive modeling.
Nearly all of them at the beginning where worried about their inability to fully understand the workings of the many algorithms that their software tool included. Should they use a neural network or a decision tree? And what about clustering? We smiled and told them not to worry about that yet. We continued to focus on the business problem, on the conceptual approach, and on the understanding of the data. They were the “owners”, but it was a true collaborative development effort.
Our joint efforts led to several initial models, and we designed pilot projects to put them in practice. The pilots were successful, the management and the team members gained confidence in the results and their own capabilities, the team grew and they started to developed some real skills. We worked with this team for a few years, initially in a collaborative development mode and very hands-on, later in a mentoring and model validation role. They called us back in particular when they needed to tackle a new problem, to validate the approach. They were capable of following the entire process as far as data understanding, manipulation and model development. We were pleased.
Over the course of the following decade we engaged with many other government agencies and private sector clients. In some cases our role was primarily that of being the designer and developers of various predictive modeling artifacts, which we then transferred to the client. In doing that we were often asked to provide knowledge transfer to one of more people charged with maintenance and possibly enhancements of the models. In many of those situations, I must say, the experiment of creating “citizen data scientists” mostly failed. Not “doing”, makes a difference, and also makes a difference when your job is being “a part-time” data scientist.
A little more than a year ago we began mentoring and team enablement journey with another large tax agency, which has created their first Analytics “Center of Excellence “. Again, we met a team of former auditors, collectors and other types of tax specialists, even a former lawyer, who, tools in hand, were given the challenging task to bring more data science to the entire agency. We are following our trusted process, teaching by doing – sitting down countless hours with our client staff and guiding them through the practice of developing models. Explaining what is necessary, in the context of our goals, not theorizing on the power of the tools or lecturing on statistics. That said, we are not afraid of “going deep” when necessary, explaining the driving principles of regression, the limitations of decision trees or the virtue of model segmentations. I can report that it is going well, a lot of work is getting done, and they are learning well.
Currently we are also working with two private sector clients, both very significant players in their respective verticals but relatively new to data science. The challenges are not that different, albeit the “recruits” have somewhat different background. The process of collaborative development seems to effectively graduate many outstanding citizen data scientists and enable these organizations in making predictive analytics part of their current capabilities.
Two out of three will fail
The previous section is definitively a tale of success, but I purposely left some details out. The detail is that typically the members of the starting team in these engagement did not completely overlap with the members of the final team. The truth is that not everyone is fit to turn into a citizen data scientist, actually most business analyst and IT people are not likely to succeed in this role, but some will and they can get really good at it.
It is hard to say what are the exact ingredients that make people fail or succeed, but some are obvious:
- Analytical mindset: the ability to interpret the patterns in the data with business acumen. To question the data if the results defeat intuition and business knowledge, but also to be open minded enough to grasp new and surprising insight when the data supports it.
- Computer skills: even very technical people can get frustrated and inefficient even using just Excel. Others just “fly” with it. General computer skills and openness to learn more IT-like stuff are essential.
- Ability to work in a team: a team of early stage practitioners must be able to rely on each other because nobody “knows it all”, to accept criticism and to help others. This does not come natural to everyone, but we believe is very important.
- Willingness to learn: the citizen data scientist may not have the educational background in statistics, math and computer science, but he or she must be willing to learn the basics and study what is needed. After all, we are never to compromise on the soundness of the results, and when in doubt the diligent citizen data scientist must have the humility to ask for help of someone who is more experienced.
- A quantitative educational background: this is not contradictory. While the citizen data scientist does not have a Master or PhD in Machine Learning, it is likelydesirable that he or she has at least a college degree in business, math, social sciences, engineering, computer science or other quantitative discipline.
Even these “prerequisites” do not guarantee success, and vice versa we have seen some very unqualified but highly motivated people succeed (although that is rare). Thus, the trick is to expose as many interested candidates as possible to data mining projects and be patience. You will find your hidden gems.
One more point on this: not all members of an effective data science team must achieve advanced “technical” skills. Some team members will be end up becoming the hands-on “doers” and others will be the “thinkers”, or better suited at interpreting results and guiding the approach. The important matter is that everyone reaches a solid understanding of the analytical process, and what is involved.
Moving to the private sector
In recent years we had the opportunity to repeat our experience of mentoring home-grown “citizen data scientist” teams in the private sector. We found a few differences and many commonalities, but our approach centered around collaborative goal-driven projects remained unchanged.
In the private sector the actors, the candidate citizen data scientists, tended to have a more “technical” background compared to their government sector peers. In many cases they were previously in a Business Intelligence analyst or database developer role, or a similar type of IT function. They understood the business side very well having worked closely with business stakeholders in past projects. Their “statistical background”, was not deeper than what we found in the public sector, but in general, their data skills were more advanced and learning the tools of the trade was less of a challenge.
However, private companies seem surprisingly more cautious in committing their best resources to the uncharted territory of advanced analytics. They tend to begin with smaller “proof of concept” project to convince management that this is the right thing to do, but the steps they take are more timid, at least in our experience. On the other hand, public sector projects tend to start with much more substantial scope, sometimes beyond what it can be realistically be accomplished by a newly formed team.
Centralized, Distributed or Embedded Team?
This is a very important decision for leaders planning to elevate a cadre of business people or IT analysts to the role of citizen data scientists. Should they consider creating a centralized “center of excellence” (CoE) in advanced analytics serving the various parts of the business? Or, should people be provided the necessary training and skills and then sent back to their functional area to evangelize the use of analytics and lead specific projects? Or, what about a “virtual team”, where everyone continues to belong to their home department, but also spends part of their time with their peers from other functional areas, possibly working on cross-functional projects?
The centralized approach is the most common in government, where the CoE actually tends to reside on the business side of the house, not IT. However, we are also working with a large agency which instead is pursuing the “embedded” approach, with a very limited part of their citizen data science team residing in a central technical functional area (Data Warehousing). Personally, I have found this scenario, the centralized CoE, to work best over the long term. The team is focused on their mission, has time to the learn tools and methodology and to work exclusively on data science projects. Furthermore, the team members can support each other and/or receive mentoring from an external consultant or more experience team members. The con of this approach is that sometimes the CoE members have a challenge creating demand for projects from the various functional areas, where people are busy running day-to-day operations and often understand little about the potential for data-driven decision making. Also, sometime the team needs time (and success stories) to gain legitimacy and credibility with the rest of the organization.
With the distributed approach citizen data scientist are trained on tools, mentored on hands-on projects, and sent back to their daily job armed with new tools knowledge and a good understanding of how analytical methods can be applied in practice. The problem, in my opinion, is indeed their “day job”. It is hard enough to learn the skills of a “data scientist apprentice”, to figure out what “chi-squared” is all about and why models tend to over fit Doing it part-time only makes the transition that much harder. In some situations, this may work, for example, if the citizen data scientist is immediately involved in an actual project within his or her organization. But without a specific mandate it is probable that the newly learned skills are going to soon be forgotten.
The “virtual team” approach is an interesting one initially pursued by one of our private sector customers. A team of five or six analysts from marketing, finance, operations and IT was assembled, and each person was allowed to use 20 to 40% of his or her time on a project together with other team members from other departments. The virtual team had regular weekly meetings, and together worked at identifying potential data science projects. External consultants were brought it to support the initial projects. I find the virtual team approach an improvement over the “distributed team” approach, but not as effective as the CoE approach. Ultimately, the problem remains the part-time nature of the new role, which makes the transition and skills development slower. While typically team members love spending time working together on data science projects, the time available for these is limited and daily responsibilities take precedence.
Selecting the rights tools
Something that I find conflicting in the recent trends in data science is that the most popular tools and analytic frameworks do not seem suited for the emerging figure of the Citizen Data Scientist. On one hand the claim is that tools are becoming more accessible to everyone, enabling the citizen data scientist to pursue sophisticated predictive analytics projects, but on the other the de facto toolbox for the modern data scientist seem to be R, Hadoop, MapReduce, Hive and Pig…all very programmatic tools. Thus, it is unclear how people with lightweight technical skills are supposed to get up to speed with data science capabilities when these tools require programming skills typical of a Computer Science graduate. I do not believe these are the type of tools that can work for the citizen data scientist.
I strongly believe that to enable business people to enter the field of advanced analytics the “programming barrier” has to be lifted, at least for them. This is probably why tools like Tableau have literally taken off in recent years. In all our work we preferred visual programming environments, like IBM SPSS Modeler, Knime or Rapid Miner. These types of tools do not require programming skills, but still require the users to understand the principles of modeling, of course. Spending less time on learning the quirks of a new language, or programming all together, definitively helps citizen data scientist focus on the data and the process of understanding it, transforming it and analyzing it. We have seen people with relatively limited computer skills master these tools quite well.
In a recent posting, Gregory Piatetsky has pointed out the danger of semi-automated analytical tools in the hands of inexperienced people. I totally subscribe to this viewpoint and always remind my clients to resist the temptation to use the latest “auto-data-preparation” and auto-modeling features of the tool. Instead, we focus on the basics, in carefully designing predictors by injecting domain knowledge into their design, savoring the patterns that the data reveals instead of relying just on the metrics. Interpreting models, looking “inside the box”, instead of being amused by their apparent lift curve. Ask questions, dig deeper, question their own work, always.
Data scientists are not perfect either
I would not be writing this article to diminish the status of the data superhero and Master of the Universe that those of us who claim to be “true” Data Scientists (not mere citizens) have enjoyed over the past few years. Indeed, I believe that any organization that truly wants to become more data-driven (and who does not these days?) and elevate their best people to the “citizen data scientist” role should consider having a few experienced data scientists in the organization to mentor and guide the process.
That said, let’s be honest. Sometimes even people with a fantastic pedigree in machine learning, statistics and computer science get too concerned with the technicalities of the process and too distant from the meaning of the data and the business objective they are trying to reach. Experience matters, and there is no doubt that many data scientists have developed an innate taste for data and patterns, and they can quickly discern between real patterns and data processing aberrations. A neophyte may not see “through” the results of a black-box algorithm, but sometimes even a person with the right background can overlook important details.
I remember my first internship in a direct marketing company, building my very first predictive model trying to find prospective customers for MCI (some of you may still remember them). After fiddling for a couple weeks with my first SAS and NeuralWare (still in existence to my surprise) processes I came out with what looked like an awesome lift curve, way better than my reference model. I ran to my boss’ office super-excited about my achievement (and completely disinterested in the fact that my model included a surrogate of my target) and laid my beautiful lift chart on his table. The kind and experience fellow looked at me, smiled, and kindly asked me to go check my model…. he didn’t even have to look any further. He was right, of course. It took a few years to get that sense of smell for what is meaningful and what is not, and I truly believe that even “slightly technical” people can get there with time and diligence. I have seen it happen many times.
While tools have evolved, becoming more broadly available, and data has grown exponentially, “the [core] technology of data science” has been available for well over 20 years. Despite that, after several cycles of hype and bust, adoption is still lagging. The problem seems to remain the lack of qualified data scientists, which, to some extent, was also due to the lack of specialized educational programs in academia. Thus, for those of us who have been waiting for the real “big wave”, paddling patiently on our algorithm engraved data miner surf boards, it is not such bad news if, finally, advanced analytics become available to more people. It can only drive more demand for expertise.
In conclusion, whether you like or dislike the term “citizen data scientist”, there is evidence that advanced analytics can broaden its reach simply by enabling more people to use it. I believe that as long as data science is restricted to the still relatively small circle of qualified people who can really understand the process and the tools, we will continue to go through cycles of hype and bust for Data Mining, Analytics, Big Data, or however you prefer to call the basic idea of using data intelligently to drive decisions. We really need this technology to become available to more people in the business world. Perhaps citizen data scientists are just those people.