- Sample-efficient active learning: we need to figure out the right levers to give domain experts to inject their expertise into the system to specialize each ML abstraction to their use case; this is crucial to get good results in each problem domain. This is both a product challenge — how can we be clever about designing the right interfaces to maximize the amount of information we gain per unit of cognitive effort the user needs to expend? — as well as a technical one — how can we minimize the number of learnable parameters we need, and how can we pick the fewest number of “examples” to query the user to label without sacrificing model quality? (Related question: what are the best ways to get users to formalize their priors and then for us to encode these priors into our models? [for a variety of different tasks/model types])
- Explainability & feedback: arguably even more important than getting good results is building trust around them, which centers around the notion of “closing the feedback loop” — how can we gather input from the user (above), communicate to them what we’re doing in terms of the input they gave us, build confidence that what we’re doing is right, and build the right levers for them to give feedback based on these explanations to refine/modify what we’re doing under the hood. This is a challenge across all parts of the stack, ranging from ML to systems to design. (Related question: how do we detect when the user has likely given us incorrect input, when the data distribution has shifted invalidating their past feedback, etc.?)
- Generalizing to complex data types: as we cover deeper customer use cases, we need to do all of this over an increasing variety of data types, all while preserving tight interactive loops and without sacrificing quality (relative to what our customers need to drive their KPIs) or explainability.
- Distributed systems: our goal is to build human-in-the-loop AI tools that augment users’ domain expertise with ML. We think of these as “power tools” (great UX with really complex tech behind the scenes to make it all possible). Part of this vision includes compute capabilities — especially to support larger datasets, we’ll need to have massive amounts of compute per-user to ensure that our system is fast (our rule: every operation needs to return at least initial results within 10 seconds; users should never wait). For example:
- When users import datasets, we’ll pre-compute relevant information related to that dataset (dependent on the datatype: distance matrices for numerical data, lots of embeddings using pre-trained networks for text data, etc.). We’ll likely need to fan this operation out to a large cluster behind the scenes — and then manage that large distributed system, etc.
- We’ll also need to prioritize the work that needs to be done dynamically based on what the user wants to do immediately after uploading a dataset (we prioritize the pre-computed artifacts that are required to support that operation; we can de-prioritize other artifacts that the user isn’t using just yet).
- Data structures: we have a bunch of interesting data structures challenges on the backend in storing and accelerating operations over customer data. For example:
- We currently store all customer datasets in-memory in a tabular format with strongly-typed columns, where these columns can contain data with both basic (int, float, string, date) and complex (time series, etc.) types. As we scale to larger datasets or more users, we obviously can’t keep storing these datasets all in-memory. However, standard databases don’t really work because each dataset has its own schema and can store arbitrarily complex datatypes within each record. What’s the best way to represent and access these datasets?
- Several of our operations learn a custom distance metric over user data, and then perform a series of queries (e.g., KNN) given this custom distance metric. Each of these queries are prohibitively slow to brute-force as data reaches any interesting scale. The typical solution to this is building an index, but this requires a fixed distance metric to be known beforehand. How do we build index-like acceleration structures that can handle changing distance metrics?
- Real-time ML training & inference: we’re going to be training and inferring over ML models constantly: re-training in real-time to incorporate user feedback (our needle-in-a-haystack module is an example of this), running inference over the entire dataset every time the user wants to see/update results, running inference on new datasets (both to pre-compute useful artifacts as discussed above and to power the “apps” part of our platform that lets users connect to a database and run the same models they just trained on this new data), etc. How do we do this efficiently to show fast results to the user and keep our own cloud costs down?