你好,游客 登录
背景:
阅读新闻

Spark的Dataset操作(三)-分组,聚合,排序

[日期:2017-09-29] 来源:csdn  作者:野男孩 [字体: ]

Spark的Dataset操作(三)-分组,聚合,排序

上一篇就说了下次主题是分组聚合。内容还挺多的,时间紧,任务重,就不瞎BB了。

数据还是用上一篇里造的那个dataset:

scala> val df = spark.createDataset(Seq(
  ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)

scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

先来个最简单的分组计数:

/*
等价SQL: select key1, count(*) from table
*/
scala> df.groupBy("key1").count.show
+----+-----+
|key1|count|
+----+-----+
| ccc|    1|
| aaa|    1|
| bbb|    2|
+----+-----+

注意,上面代码中的count不是记录数,而是对groupBy的聚合结果的计数。如果是要看分组后有多少条记录,代码如下。可以看到在这个示例数据集中结果应该是3条:

/*
等价SQL: select distinct key1 from table
*/
scala> df.select("key1").distinct.show
+----+
|key1|
+----+
| ccc|
| aaa|
| bbb|
+----+

/*
等价SQL: 
select count(distinct key1) from table
*/
scala> df.select("key1").distinct.count
res3: Long = 3

上面的结果中,如果你跟我一样有强迫症的话,显然应该注意到了key1的显示没有排序,不能忍。修改如下:

/*
等价sql: 
select key1 , count(*) from table 
group by key1 
order by key1
*/

scala> df.groupBy("key1").count.sort("key1").show
+----+-----+
|key1|count|
+----+-----+
| aaa|    1|
| bbb|    2|
| ccc|    1|
+----+-----+

/*
等价sql: 
select key1 , count(*) from table 
group by key1 
order by key1 desc
*/

scala> df.groupBy("key1").count.sort($"key1".desc).show
+----+-----+
|key1|count|
+----+-----+
| ccc|    1|
| bbb|    2|
| aaa|    1|
+----+-----+

注意,上面一个是升序,一个是降序。和”select key1 , count(*) from table group by key1 order by key1 desc”降序的时候指定desc的时候,前面的key1跟了一个col(columnname)$前缀就不再解释了。

继续完善下,之前默认是按照分组计数的大小的升序排列的。如果要按分组计数的大小的逆序排序要怎么做呢?看之前的show结果,计数列显示的列名就是count。所以,自然就能想到下面的写法:

/*
等价sql: 
select key1 , count(*) from table  
group by key1 
order by count(*) desc
*/
scala> df.groupBy("key1").count.sort($"count".desc).show
+----+-----+
|key1|count|
+----+-----+
| bbb|    2|
| ccc|    1|
| aaa|    1|
+----+-----+

或者是用withColumnRenamed函数给列重命名:

/*
等价sql: 
select key1 , count(*) as cnt from table  
group by key1 
order by cnt desc
*/
scala> df.groupBy("key1").count.withColumnRenamed("count", "cnt").sort($"cnt".desc).show
+----+---+
|key1|cnt|
+----+---+
| bbb|  2|
| aaa|  1|
| ccc|  1|
+----+---+

更常用的方法是,直接给count(*)来个别名。如下:

/*
等价sql: 
select key1 , count(*) as cnt from table  
group by key1 
order by cnt desc
*/
scala> df.groupBy("key1").agg(count("key1").as("cnt")).show
+----+---+
|key1|cnt|
+----+---+
| ccc|  1|
| aaa|  1|
| bbb|  2|
+----+---+

嗯,我们看到这里引入了聚合函数agg。这函数通常是配合groupBy的,用法灵活。下面用几个示例代码直接上,注意区别Column类型参数和String类型参数:

def agg(expr: Column, exprs: Column*): DataFrame
def agg(exprs: Map[String, String]): DataFrame
def agg(aggExpr: (String, String), 
        aggExprs: (String, String)*): DataFrame

/*
等价sql:
select key1, count(key1), max(key2), avg(key3)
from table
group by key1
*/
scala> df.groupBy("key1").agg(count("key1"), max("key2"), avg("key3")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1")
        .agg("key1"->"count", "key2"->"max", "key3"->"avg").show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1").agg(Map(("key1","count"), ("key2","max"), ("key3","avg"))).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1")
        .agg(("key1","count"), ("key2","max"), ("key3","avg")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

/*
等价sql:
select key1, count(key1) cnt, 
       max(key2) max_key2, avg(key3) avg_key3 
from table 
group by key1 
order by key1, max_key2 desc 
*/

scala> df.groupBy("key1").agg(count("key1").as("cnt"), max("key2").as("max_key2"), avg("key3").as("avg_key3")).sort($"cnt",$"max_key2".desc).show
+----+---+--------+--------+
|key1|cnt|max_key2|avg_key3|
+----+---+--------+--------+
| ccc|  1|       3|     5.0|
| aaa|  1|       1|     2.0|
| bbb|  2|       4|     5.0|
+----+---+--------+--------+

嗯,大概就这些吧。sort()和orderBy()是一样的, orderBy函数就是sort函数的别名。

上一篇:Spark的Dataset操作(二)-过滤的filter和where

下一篇:Spark的Dataset操作(四)-其他单表操作

 

 

收藏 推荐 打印 | 阅读:
相关新闻       spark  dataset