Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

Closed
wants to merge 11 commits into from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented May 16, 2024

What changes were proposed in this pull request?

This PR replaces AmIpFilter with a forked implementation, and removes the dependency hadoop-yarn-server-web-proxy

Why are the changes needed?

SPARK-47118 upgraded Spark built-in Jetty from 10 to 11, and migrated from javax.servlet to jakarta.servlet, which breaks the Spark on YARN.

Caused by: java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
    at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
    at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
    at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
    at java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
    at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
    ... 38 more

During the investigation, I found a comment here #31642 (comment)

Agree that in the long term we should either: 1) consider to re-implement the logic in Spark which allows us to get away from server-side dependency in Hadoop ...

This should be a simple and clean way to address the exact issue, then we don't need to wait for Hadoop jakarta.servlet migration, and it also strips a Hadoop dependency.

Does this PR introduce any user-facing change?

No, this recovers the bootstrap of the Spark application on YARN mode, keeping the same behavior with Spark 3.5 and earlier versions.

How was this patch tested?

UTs are added. (refer to org.apache.hadoop.yarn.server.webproxy.amfilter.TestAmFilter)

I tested it in a YARN cluster.

Spark successfully started.

root@hadoop-master1:/opt/spark-SPARK-48238# JAVA_HOME=/opt/openjdk-17 bin/spark-sql --conf spark.yarn.appMasterEnv.JAVA_HOME=/opt/openjdk-17 --conf spark.executorEnv.JAVA_HOME=/opt/openjdk-17
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2024-05-18 04:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-18 04:11:44 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive} is set, falling back to uploading libraries under SPARK_HOME.
Spark Web UI available at http://hadoop-master1.orb.local:4040
Spark master: yarn, Application Id: application_1716005503866_0001
spark-sql (default)> select version();
4.0.0 4ddc2303c7cbabee12a3de9f674aaacad3f5eb01
Time taken: 1.707 seconds, Fetched 1 row(s)
spark-sql (default)>

When access http://hadoop-master1.orb.local:4040, it redirects to http://hadoop-master1.orb.local:8088/proxy/redirect/application_1716005503866_0001/, and the UI looks correct.

image

Was this patch authored or co-authored using generative AI tooling?

No

@mridulm
Copy link
Contributor

mridulm commented May 16, 2024

+CC @tgravescs

@HiuKwok
Copy link
Contributor

HiuKwok commented May 17, 2024

The patch makes sense to me, and supposedly the workaround will be removed when the Yarn side upgrades their J2EE?

@pan3793
Copy link
Member Author

pan3793 commented May 17, 2024

... supposedly the workaround will be removed when the Yarn side upgrades their J2EE?

I suppose not. According to the discussion in #31642 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter should be part of hadoop-client-api but actually not, and

Agree that in the long term we should either: 1) consider to re-implement the logic in Spark which allows us to get away from server-side dependency in Hadoop ...

@pan3793 pan3793 marked this pull request as ready for review May 17, 2024 09:31
@pan3793
Copy link
Member Author

pan3793 commented May 17, 2024


import org.apache.spark.internal.Logging

// This class is inspired by org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify this, is it just copy converted to scala or did you add/change functionality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the original code is Java, I'd prefer we also add a Java file in Spark, so that it's easier to keep the code in sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functional equivalent rewrite. Simplify the code by converting to Scala and merging code from several Hadoop Java classes.

Copy link
Member Author

@pan3793 pan3793 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was consider to port the code as-is(also keeping package and class name) as we did in the thriftserver module for hive-service, while it would cause class conflicts because hadoop-client-minicluster(present in test scope) ships those classes too. Meanwhile, I found an existing YarnProxyRedirectFilter which is much simpler and is not coupled with Hadoop classes. So I rename the class to YarnAMIpFilter, rewrite it in Scala, and put it alongside YarnProxyRedirectFilter. UT YarnAMIpFilterSuite is ported too to ensure the correctness of rewriting.

@cloud-fan if we pursue sync with Hadoop upstream, I can convert it back to Java classes, and reuse the Hadoop code as much as possible. For example, use org.apache.hadoop.yarn.webapp.hamlet2.Hamlet to construct the HTML page in sendRedirect instead of writing by hand as we did in YarnProxyRedirectFilter.

@cloud-fan
Copy link
Contributor

This looks like a good idea to get rid of the conflicting dependency. We should clarify #46611 (comment) though.

@pan3793 pan3793 changed the title [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation May 18, 2024
@pan3793
Copy link
Member Author

pan3793 commented May 18, 2024

@cloud-fan I updated the code to reserve original code as much as possible, also leave comments to clarify the modifications.

@@ -696,7 +696,7 @@ private[spark] class ApplicationMaster(

/** Add the Yarn IP filter that is required for properly securing the UI. */
private def addAmIpFilter(driver: Option[RpcEndpointRef], proxyBase: String) = {
val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
val amFilter = "org.apache.spark.deploy.yarn.AmIpFilter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use classOf to avoid hardcoding the class name here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -86,7 +86,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
}

// Add Yarn proxy filter specific configurations to the recovered SparkConf
val filter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
val filter = "org.apache.spark.deploy.yarn.AmIpFilter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the class defined in the yarn module is not visible in the streaming module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can't use classOf[AmIpFilter].getName here due to dependency issues.

proxyUriBases.put("dummy", conf.getInitParameter(PROXY_URI_BASE));
} else {
proxyHosts = conf.getInitParameter(PROXY_HOSTS)
.split(PROXY_HOSTS_DELIMITER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Indentation, please fix it globally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import org.apache.spark.internal.SparkLogger;
import org.apache.spark.internal.SparkLoggerFactory;

// This class is copied from Hadoop 3.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we need to re-copy these files every time we upgrade Hadoop if there are changes to them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We don't need to always sync it with the Hadoop version.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@LuciferYang
Copy link
Contributor

Merged into master for Spark 4.0. Thanks @pan3793 and @cloud-fan

@dongjoon-hyun
Copy link
Member

Thank you, @pan3793 and all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants