[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

pan3793 · 2024-05-16T07:37:35Z

What changes were proposed in this pull request?

This PR replaces AmIpFilter with a forked implementation, and removes the dependency hadoop-yarn-server-web-proxy

Why are the changes needed?

SPARK-47118 upgraded Spark built-in Jetty from 10 to 11, and migrated from javax.servlet to jakarta.servlet, which breaks the Spark on YARN.

Caused by: java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
    at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
    at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
    at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
    at java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
    at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
    ... 38 more

During the investigation, I found a comment here #31642 (comment)

Agree that in the long term we should either: 1) consider to re-implement the logic in Spark which allows us to get away from server-side dependency in Hadoop ...

This should be a simple and clean way to address the exact issue, then we don't need to wait for Hadoop jakarta.servlet migration, and it also strips a Hadoop dependency.

Does this PR introduce any user-facing change?

No, this recovers the bootstrap of the Spark application on YARN mode, keeping the same behavior with Spark 3.5 and earlier versions.

How was this patch tested?

UTs are added. (refer to org.apache.hadoop.yarn.server.webproxy.amfilter.TestAmFilter)

I tested it in a YARN cluster.

Spark successfully started.

root@hadoop-master1:/opt/spark-SPARK-48238# JAVA_HOME=/opt/openjdk-17 bin/spark-sql --conf spark.yarn.appMasterEnv.JAVA_HOME=/opt/openjdk-17 --conf spark.executorEnv.JAVA_HOME=/opt/openjdk-17
WARNING: Using incubator modules: jdk.incubator.vector
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2024-05-18 04:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-18 04:11:44 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive} is set, falling back to uploading libraries under SPARK_HOME.
Spark Web UI available at http://hadoop-master1.orb.local:4040
Spark master: yarn, Application Id: application_1716005503866_0001
spark-sql (default)> select version();
4.0.0 4ddc2303c7cbabee12a3de9f674aaacad3f5eb01
Time taken: 1.707 seconds, Fetched 1 row(s)
spark-sql (default)>

When access http://hadoop-master1.orb.local:4040, it redirects to http://hadoop-master1.orb.local:8088/proxy/redirect/application_1716005503866_0001/, and the UI looks correct.

Was this patch authored or co-authored using generative AI tooling?

No

mridulm · 2024-05-16T20:06:54Z

+CC @tgravescs

HiuKwok · 2024-05-17T03:12:35Z

The patch makes sense to me, and supposedly the workaround will be removed when the Yarn side upgrades their J2EE?

pan3793 · 2024-05-17T03:17:40Z

... supposedly the workaround will be removed when the Yarn side upgrades their J2EE?

I suppose not. According to the discussion in #31642 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter should be part of hadoop-client-api but actually not, and

Agree that in the long term we should either: 1) consider to re-implement the logic in Spark which allows us to get away from server-side dependency in Hadoop ...

…nAMIpFilter

pan3793 · 2024-05-17T10:29:31Z

cc @cloud-fan @dongjoon-hyun @LuciferYang @srowen

tgravescs · 2024-05-17T13:24:37Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAMIpFilter.scala

+
+import org.apache.spark.internal.Logging
+
+// This class is inspired by org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter


can you clarify this, is it just copy converted to scala or did you add/change functionality?

If the original code is Java, I'd prefer we also add a Java file in Spark, so that it's easier to keep the code in sync.

This is a functional equivalent rewrite. Simplify the code by converting to Scala and merging code from several Hadoop Java classes.

I was consider to port the code as-is(also keeping package and class name) as we did in the thriftserver module for hive-service, while it would cause class conflicts because hadoop-client-minicluster(present in test scope) ships those classes too. Meanwhile, I found an existing YarnProxyRedirectFilter which is much simpler and is not coupled with Hadoop classes. So I rename the class to YarnAMIpFilter, rewrite it in Scala, and put it alongside YarnProxyRedirectFilter. UT YarnAMIpFilterSuite is ported too to ensure the correctness of rewriting.

@cloud-fan if we pursue sync with Hadoop upstream, I can convert it back to Java classes, and reuse the Hadoop code as much as possible. For example, use org.apache.hadoop.yarn.webapp.hamlet2.Hamlet to construct the HTML page in sendRedirect instead of writing by hand as we did in YarnProxyRedirectFilter.

cloud-fan · 2024-05-17T14:12:03Z

This looks like a good idea to get rid of the conflicting dependency. We should clarify #46611 (comment) though.

pan3793 · 2024-05-18T04:14:54Z

@cloud-fan I updated the code to reserve original code as much as possible, also leave comments to clarify the modifications.

cloud-fan · 2024-05-20T01:21:52Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

@@ -696,7 +696,7 @@ private[spark] class ApplicationMaster(

  /** Add the Yarn IP filter that is required for properly securing the UI. */
  private def addAmIpFilter(driver: Option[RpcEndpointRef], proxyBase: String) = {
-    val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
+    val amFilter = "org.apache.spark.deploy.yarn.AmIpFilter"


nit: we can use classOf to avoid hardcoding the class name here.

cloud-fan · 2024-05-20T01:22:02Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

@@ -86,7 +86,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
    }

    // Add Yarn proxy filter specific configurations to the recovered SparkConf
-    val filter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
+    val filter = "org.apache.spark.deploy.yarn.AmIpFilter"


the class defined in the yarn module is not visible in the streaming module

Yes, we can't use classOf[AmIpFilter].getName here due to dependency issues.

LuciferYang · 2024-05-20T02:37:26Z

resource-managers/yarn/src/main/java/org/apache/spark/deploy/yarn/AmIpFilter.java

+      proxyUriBases.put("dummy", conf.getInitParameter(PROXY_URI_BASE));
+    } else {
+      proxyHosts = conf.getInitParameter(PROXY_HOSTS)
+          .split(PROXY_HOSTS_DELIMITER);


nit: Indentation, please fix it globally.

LuciferYang · 2024-05-20T02:42:14Z

resource-managers/yarn/src/main/java/org/apache/spark/deploy/yarn/AmIpFilter.java

+import org.apache.spark.internal.SparkLogger;
+import org.apache.spark.internal.SparkLoggerFactory;
+
+// This class is copied from Hadoop 3.4.0


So, do we need to re-copy these files every time we upgrade Hadoop if there are changes to them?

No. We don't need to always sync it with the Hadoop version.

LuciferYang

+1, LGTM

LuciferYang · 2024-05-20T12:43:43Z

Merged into master for Spark 4.0. Thanks @pan3793 and @cloud-fan

dongjoon-hyun · 2024-05-20T15:50:35Z

Thank you, @pan3793 and all!

github-actions bot added BUILD YARN DSTREAM labels May 16, 2024

pan3793 force-pushed the SPARK-48238 branch from 8f5525a to ac9008b Compare May 16, 2024 08:47

pan3793 added 2 commits May 17, 2024 17:30

[SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented Yar…

b772645

…nAMIpFilter

polish and ut

0b8dd36

pan3793 force-pushed the SPARK-48238 branch from 050e3d9 to 0b8dd36 Compare May 17, 2024 09:30

pan3793 marked this pull request as ready for review May 17, 2024 09:31

checkpoint

987fc5a

tgravescs reviewed May 17, 2024

View reviewed changes

pan3793 added 2 commits May 18, 2024 04:07

update comment

0ee3ee5

copy

cb75c19

pan3793 changed the title ~~[SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter~~ [SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation May 18, 2024

pan3793 added 2 commits May 18, 2024 11:58

comment

ce20d7f

nit

4ddc230

pan3793 added 2 commits May 18, 2024 12:16

nit

7e401c9

fix checkstyle

a26e040

cloud-fan reviewed May 20, 2024

View reviewed changes

cloud-fan approved these changes May 20, 2024

View reviewed changes

classOf

cbaaecb

LuciferYang reviewed May 20, 2024

View reviewed changes

LuciferYang approved these changes May 20, 2024

View reviewed changes

indent

b67b0ae

LuciferYang approved these changes May 20, 2024

View reviewed changes

LuciferYang closed this in 4fc2910 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

pan3793 commented May 16, 2024 •

edited

mridulm commented May 16, 2024

HiuKwok commented May 17, 2024

pan3793 commented May 17, 2024

pan3793 commented May 17, 2024

tgravescs May 17, 2024

cloud-fan May 17, 2024

pan3793 May 17, 2024

pan3793 May 17, 2024 •

edited

cloud-fan commented May 17, 2024

pan3793 commented May 18, 2024

cloud-fan May 20, 2024

pan3793 May 20, 2024

cloud-fan May 20, 2024

pan3793 May 20, 2024

LuciferYang May 20, 2024

LuciferYang May 20, 2024

pan3793 May 20, 2024

LuciferYang May 20, 2024

pan3793 May 20, 2024

LuciferYang left a comment

LuciferYang commented May 20, 2024

dongjoon-hyun commented May 20, 2024


		import org.apache.spark.internal.Logging

		// This class is inspired by org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter

[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611

Conversation

pan3793 commented May 16, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mridulm commented May 16, 2024

HiuKwok commented May 17, 2024

pan3793 commented May 17, 2024

pan3793 commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 May 17, 2024 • edited

Choose a reason for hiding this comment

cloud-fan commented May 17, 2024

pan3793 commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang left a comment

Choose a reason for hiding this comment

LuciferYang commented May 20, 2024

dongjoon-hyun commented May 20, 2024

pan3793 commented May 16, 2024 •

edited

pan3793 May 17, 2024 •

edited