pyspark.sql.functions.from_xml#
- pyspark.sql.functions.from_xml(col, schema, options=None)[source]#
- Parses a column containing a XML string to a row with the specified schema. Returns null, in the case of an unparseable string. - New in version 4.0.0. - Parameters
- colColumnor str
- a column or column name in XML format 
- schemaStructType,Columnor str
- a StructType, Column or Python string literal with a DDL-formatted string to use when parsing the Xml column 
- optionsdict, optional
- options to control parsing. accepts the same options as the Xml datasource. See Data Source Option for the version you use. 
 
- col
- Returns
- Column
- a new column of complex type from given XML object. 
 
 - Examples - Example 1: Parsing XML with a DDL-formatted string schema - >>> import pyspark.sql.functions as sf >>> data = [(1, '''<p><a>1</a></p>''')] >>> df = spark.createDataFrame(data, ("key", "value")) ... # Define the schema using a DDL-formatted string >>> schema = "STRUCT<a: BIGINT>" ... # Parse the XML column using the DDL-formatted schema >>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect() [Row(xml=Row(a=1))] - Example 2: Parsing XML with - ArrayTypein schema- >>> import pyspark.sql.functions as sf >>> data = [(1, '<p><a>1</a><a>2</a></p>')] >>> df = spark.createDataFrame(data, ("key", "value")) ... # Define the schema with an Array type >>> schema = "STRUCT<a: ARRAY<BIGINT>>" ... # Parse the XML column using the schema with an Array >>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect() [Row(xml=Row(a=[1, 2]))] - Example 3: Parsing XML using - pyspark.sql.functions.schema_of_xml()- >>> import pyspark.sql.functions as sf >>> # Sample data with an XML column ... data = [(1, '<p><a>1</a><a>2</a></p>')] >>> df = spark.createDataFrame(data, ("key", "value")) ... # Generate the schema from an example XML value >>> schema = sf.schema_of_xml(sf.lit(data[0][1])) ... # Parse the XML column using the generated schema >>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect() [Row(xml=Row(a=[1, 2]))]