Abstract: Everyday I talk about machine learning. Do you know that machine learning practitioners are doing it?
This article is the first in a series of articles, Part 2 will cover AutoML and neural architecture search, and Part 3 will specifically introduce Google's AutoML.
The scarcity of machine learning talent and the company's commitment to automate machine learning and the ability to completely eliminate the need for ML expertise are often on the headlines of the media. In the keynote address of TensorFlow DevSummit, Jeff Dean, Google's AI director, estimates that there are tens of millions of organizations that have the data they need for machine learning without the necessary expertise and skills. Because I mainly focus on getting more people to use machine learning and make it easier to use at fast.ai, I pay close attention to the scarcity of machine learning talents just mentioned.
When it comes to automating some of the work of machine learning and making it easier for people with a broader background to use the technology, the first thing to ask is: What exactly is the machine learning industry practitioner doing? Any solution to solve the scarcity of machine learning expertise needs to answer this question: Do we know what skills to teach, what tools to build, or what to automate.
Building data products is a complex job
While many academic sources of machine learning are almost always predictive modeling, this is just one of the things that machine learning does under normal conditions. Properly analyzing business problems, collecting and cleaning up data, building models, implementing results, and then monitoring changes are interrelated in many ways, which is often difficult to isolate only through a single part (at least not knowing what other parts need) . As Jeremy Howard and others wrote in "Designing a Great Data Product," great predictive modeling is an important part of the solution, but it is no longer independent; as the product becomes more complex, it disappears In the pipeline.
A team from Google, D. Sculley, etc. wrote a classic machine learning case: "High-interest credit card for technical debt," which is about the code complexity and technical debt that often arises when using machine learning in practice. The authors have discovered many system-level interactions, risks, and anti-patterns, including:
1. Glue code: A large number of support codes written to input and export data to a general-purpose package;
2. Pipeline jungles: Systems that prepare data in ML friendly format may become a jungle of scraping, joining, and sampling steps, usually with intermediate file output;
3. Reusing the input signal can result in unintended tight coupling of other disjoint systems;
4. Changes in the external environment may expose the risk of accidental changes in the behavior of the model or input signals, which may be difficult to monitor.
The author writes: An important part of the real-world "machine learning" work is to address this form of problem... It is worth noting that glue code and pipeline jungle are symptoms of integration problems and may be excessively separate." The root cause of the "research" and "engineering" roles... Academia may be surprised to find that only a small percentage of the code in many machine learning systems is actually "machine learning."
When the machine learning project fails
In one of the machine learning projects, I found a failure mode that failed in the workspace:
1. The data science team built a very cool thing but it will never be used. Regarding the work they are doing, there is no support for the rest of the organization, and some data scientists are not very clear about what to put into production.
2. Data scientists have a backlog of production models much faster than engineering supports production models.
3. Data architecture engineers are separated by data scientists. There are no data required by data scientists in the pipeline, and data scientists are also using data sources collected by data architecture engineers.
4. The company has clearly decided to produce features/products. X. They need data scientists to collect some data to support this decision. Data scientists feel that PM is ignoring data that contradicts decision-making; PM believes that data scientists are ignoring other business logic.
5. Data scientists are overkill: The Data Science team interviewed an applicant for an impressive position in mathematical modeling and engineering skills. Once hired, job seekers will join a vertical product team that requires simple business analysis.
Previously, I saw these as organizational failures, but they could also be described as part of a complex system where practitioners are overly concerned with the complete data products. These are the failures of communication and target alignment between different parts of the data product pipeline.
So what do people in the machine learning industry do?
As mentioned above, building machine learning products is a multifaceted and complex task. Here are some things that machine learning practitioners may need to do in the process:
Understand the context:
1. Identify business areas that can benefit from machine learning;
2. Communicate with other stakeholders about what machine learning is and what they don't have (usually there are many misunderstandings);
3. Understand business strategy, risks and goals, and ensure Everyone is on the same platform;
4. Determine what data the organization has;
5. Build and review tasks appropriately;
6. Understand operational constraints (for example, candidates for actual available data at the time of reasoning);
7. Active identification Moral hazard, including harassers or propaganda/false publicity activities (and plan how to reduce these risks);
8. Identify potential biases and potential negative feedback loops.
1. Develop a plan to collect more different data;
2. Organize data from many different sources: These data are usually collected in different formats or inconsistent conventions;
3. Process lost or corrupted data;
4. Visualize data ;
5. create an appropriate training set, validation and test sets;
1. Choose which model to use;
2. Incorporate model resource requirements into constraints (for example, whether the completed model needs to run on an edge device, run in a low-memory or high-latency environment, etc.);
3. Select hyperparameters (for example, in the case of deep learning, this includes selecting the schema, loss function, and optimizer);
4. Train the model (and debug why the training was not successful), which may involve:
4.1 Adjusting hyperparameters (eg learning rate);
4.2 Output intermediate results to see how losses, training errors and verification errors change over time;
4.3 Check the model error data to find the pattern;
4.4 Identify potential errors or problems with the data;
4.5 Thinking about how you need to change the way you clean and preprocess data;
4.6 realize that you need more or different data enhancements;
4.7 realize that you need more or different data;
4.8 Try different models;
4.9 Determine if your data is under-fitting or over-fitting;
1. Use your model as an endpoint to create an API or web application for productization;
2. Export the model to the desired format;
3. Plan how often your model needs to be retrained with updated data;
1. Track the change of the model over time;
2. Monitor the input data to determine if it changes over time, thereby invalidating the model;
3. Communicate your results to other members of the organization;
4. Develop supervision and A plan to deal with errors or unintended consequences.
Of course, not every machine learning practitioner needs to do all of the above steps, but the components of this process will become part of many machine learning applications. Even if you are only part of these steps, familiarizing yourself with the rest of the process will help ensure that you don't overlook the precautions that will hinder the success of the project!
The two most difficult parts of machine learning
For me and many others I know, I want to emphasize the most time-consuming and frustrating aspects of machine learning (especially deep learning):
1. Handling data formatting, inconsistencies and errors is often a confusing and cumbersome process.
2. Training a deep learning model is a well-known fragile process.
Is cleaning up data really part of ML? Yes.
Handling data formatting, inconsistencies and errors is often a confusing and cumbersome process. People sometimes describe machine learning as a process of separation from data science. Just like machine learning, you can start by perfectly cleaning up data and formatting data sets. However, in my experience, the process of cleaning up data sets and training models is often intertwined: I often find problems in the model training that lead me to return and change the preprocessing of the input data.
Training deep learning models is fragile and difficult
The difficulty of training models has scared many beginners who are often frustrated. Even experts often complain about how frustrating and fickle the model training process is. An artificial intelligence researcher at Stanford University told me that he taught in-depth courses and let all students do their own projects, which is too difficult! Students can't get their models trained, and we usually say, “Okay, this is deep learning.” Ali Rahimi, an artificial intelligence researcher with more than a decade of experience and a NIPS 2017 time of the year, complained about the fragility of model training in his NIPS awards presentation. Someone asked AI researchers: How many of you have designed a deep network from scratch, building it from design, architecture, and other processes, and when it doesn't work, you feel terrible? Many people raised their hands. For me, it happens every 3 months. Even the fact that AI experts sometimes have difficulty training new models means that the process has not been able to automate the way it is incorporated into general-purpose products. Some of the biggest advances in deep learning will be achieved by discovering more powerful training methods. We have seen some advances like dropout (dropout refers to the neural network unit that temporarily discards it from the network with a certain probability during the training of the deep learning network), hyper-fusion and migration learning, all of which Make training easier. By migrating the power of learning, model training can be a robust process when defined for a sufficiently narrow problem domain. However, we still have ways to make training more robust.
For academic researchers
Even if you are engaged in theoretical research in machine learning, it is useful to understand the processes that machine learning practitioners experience in practical problems, as this may give you insights about the most relevant or influential research areas.
As Googler engineer D. Sculley and others wrote, technical debt is a problem that engineers and researchers need to pay attention to. Research solutions that provide a small precision advantage at the expense of significantly increasing system complexity are rarely sensible... Reducing technical debt is not always as exciting as proving the new theorem, but it is continuous A key part of strong innovation. Developing a comprehensive, elegant solution for a complex machine learning system is a very rewarding job.
Now that we have outlined some of the tasks that machine learning practitioners have done in their work, we are ready to evaluate the attempt to automate this work. As the name suggests, AutoML is a field focused on automated machine learning. As a sub-domain of AutoML, neural architecture search is receiving a lot of attention.
The above is the translation. Translated by Alibaba Yunqi Community .
The original title of the article " What do machine learning practitioners actually do?"
Author: Rachel Thomas Translator: Tiger said eight , revision:
The article is a simple translation, for more details, please check the original text .
More technical dry goods, please pay attention to the Yunqi community knows the institution number: Aliyun Yunqi Community - Knowing
This article is the original content of Yunqi Community and may not be reproduced without permission.