Terminology
Category: An atomic label from a taxonomy that describes the function or meaning of a single code instruction.
Inconsistency: A disagreement between profiling functions regarding the category of a given instruction. These are tracked and analyzed to resolve conflicts or mark them as undetermined.
Notebook: A Jupyter Notebook for ML written in Python; it serves as the system’s primary input.
MLProfile: The structured output format representing a profiled notebook. It is targeted by profiling functions (e.g., LLMs, parsers). MLProfiles are encoded in JSON, conforming to the MLProfile metamodel.
MLProfile-MM (Metamodel):
A shared, structured data model that defines the schema for MLProfiles. It ensures interoperability and consistency across different profiling functions.
Pattern: A reusable template that defines a specific profile structure. Patterns can be used to filter and compare profiles across notebooks.
Profiling Bundle: The data structure that aggregates all information related to the profiling of a notebook: the raw profile, individual profiles, reference profile, and detected inconsistencies.
Profiling Function: A specific method or tool that assigns a category (from a taxonomy) to each code instruction in the notebook. These functions can rely on static parsing, LLMs, or other heuristics.
Profile Registry: A system-wide structure that maintains a collection of profiled notebooks and their associated profiling bundles, allowing querying, comparison, and analysis at scale.
Raw Profile: An Abstract Syntax Tree (AST) enhanced with metadata to preserve the original code and its location within the notebook.
Reference Profile: A unified profile representing a coherent notebook summary derived from the output of multiple profiling functions. It resolves inconsistencies when possible and may contain undetermined mappings where ambiguity remains.
Step: A special kind of category within a taxonomy used to describe the high-level objective of an instruction in a machine learning workflow. Examples include: Data Preparation
, Training
, Evaluation
, etc.
Taxonomy: A defined set of categories used to semantically categorize code instructions according to their purpose or behavior in an ML pipeline.
Formalized definitions of these terms are available in Formalization.