How to convert categorical data to numerical data in Pyspark

Published on Author Code Father
How to convert categorical data to numerical data in Pyspark

This can be done using StringIndexer in PySpark and the reverse using IndexToString for reference please check this:

from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

For more details, please check the spark documentation

Comments

comments