你好,游客 登录
背景:
阅读新闻

Spark的Dataset操作(二)-过滤的filter和where

[日期:2017-09-29] 来源:csdn  作者:野男孩 [字体: ]

Spark的Dataset操作(二)-过滤的filter和where

话说第一集发完也没看,格式乱成那样子,太可怕了。要不是有好心人评论了下,我还不知道,囧。这次换个Markdown编辑器接着来吧。

上一篇说的是Select的用法,这次说说Where部分。Where部分可以用filter函数和where函数。这俩函数的用法是一样的,官网文档里都说where是filter的别名。

数据还是用上一篇里造的那个dataset:

scala> val df = spark.createDataset(Seq(
  ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

filter函数

从Spark官网的文档中看到,filter函数有下面几种形式:

def filter(func: (T) ⇒ Boolean): Dataset[T]
def filter(conditionExpr: String): Dataset[T]
def filter(condition: Column): Dataset[T]

所以,以下几种写法都是可以的:

scala> df.filter($"key1">"aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+


scala> df.filter($"key1"==="aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter("key1='aaa'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter("key2=1").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter($"key2"===3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| ccc|   3|   5|
+----+----+----+

scala> df.filter($"key2"===$"key3"-1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
+----+----+----+

其中, ===是在Column类中定义的函数,对应的不等于是=!=
$”列名”这个是语法糖,返回Column对象

where函数

scala> df.where("key1 = 'bbb'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key2"=!= 3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key3">col("key2")).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key3">col("key2")+1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

嗯嗯,要看书去了,就到这吧,下次接着写分组聚合的部分。

上一篇:Spark的Dataset操作(一)-列的选择select

下一篇:Spark的Dataset操作(三)-分组,聚合,排序

 

收藏 推荐 打印 | 阅读: