Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python UDF support #45843

Open
stevenzwu opened this issue May 17, 2024 · 0 comments
Open

Python UDF support #45843

stevenzwu opened this issue May 17, 2024 · 0 comments

Comments

@stevenzwu
Copy link

Feature request

Is your feature request related to a problem? Please describe.

Just like SQL is everywhere, Python is widely used in the big data and ML community. In Apache Spark, Python UDF is among the most popular features. This can open up StarRocks to a lot more use cases.

Describe the solution you'd like

Broadly there are two types of use cases.

  1. Well tested and widedly used UDFs within an org. Those stable UDFs can be baked into the image and preloaded from a predefined folder during cluster initialization.
  2. Experimental UDFs - either something that is exploratory or under development. It is undesirable to rebuild a new docker image every time users want to try a new UDF to explore the data. It has heavy admin overhead and long turn-around time. Users should be able to dynamically create/register Python UDF via SQL DDL.

Python is a dynamic language, which makes it possible to dynamically load Python code.

  1. inline code - Python code can be supplied in the CREATE FUNCTION SQL statement directly.
  2. remote code - Pyhon code can fetched from a remote storage (like S3 or HTTP endpoint). The CREATE FUNCTION statement specifies the location of the remote storage.

Dynamically loaded code don't need to be persisted and reloaded during cluster redeployment (like upgrade).

Python code execution should be contained within the StarRocks cluster (e.g. some sidecar Python processes in the BE node).

Describe alternatives you've considered

Executing Python code in external remote service (outside StarRocks cluster) requires additional operation overhead of maintaining another microservice implemented in Python. The Python microserver also needs to handle the UDF lifecycle management (create, delete, show) etc. It will has a detached experience compared to standard SQL DDLs.

Additional context

More advanced features beyond the MVP described above.

Python packages

Some widely used packages can be pre-baked into the image. Users can also dynamically load Python packages as part of the CREATE FUNCTION statement for exploratory purposes.

Python version

to start with, we can assume a single StarRocks cluster only support one Python version. All Python UDF code and dependent packages need to be compatible to the supported version. We will need to figure out the Python version evolution story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant