博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
pig笔记
阅读量:6428 次
发布时间:2019-06-23

本文共 2828 字,大约阅读时间需要 9 分钟。

原谅我只是拿这个当笔记来写了,最近写的就这几个常用的

1.基本使用

REGISTER /home/vlab/ykt/StandartTrjn.jarDEFINE StandartTrjn com.zhangdan.pig.StandartTrjn();data_load = load 'yktdata/pid06/pid06_09.csv' using PigStorage(',') as (account:chararray,des:chararray,jntime:chararray);data_group = group data_load by account;--一次刷卡记录data_seq = foreach data_group{    sorted = order data_load by jntime;    generate flatten(StandartTrjn(sorted));}store data_seq into 'yktdata/pid06_result09/standart' using PigStorage(',');----------生成所有的关系对REGISTER /home/vlab/ykt/GetConnection.jarDEFINE GetConnection com.zhangdan.ykt.GetConnection();data_load = load 'yktdata/pid06_result09/standart/part-r-*' using PigStorage(',') as (account:chararray,des:chararray,year:chararray,month:chararray,day:chararray,jnstart:chararray,jnend:chararray,times:int);data_group = group data_load by (year,month,day,des);data_seq = foreach data_group{   valid = distinct data_load;   sorted = order valid by jnstart;   generate flatten(GetConnection(sorted));};--data_dis = distinct data_seq;store data_seq into 'yktdata/pid06_result09/connection' using PigStorage(','); --------统计相遇次数data_load = load 'yktdata/pid06_result09/connection/part-r-*' using PigStorage(',') as (account1:chararray,account2:chararray,jntime1:chararray,jntime2:chararray);data_group = group data_load by (account1,account2);data_count = foreach data_group{    generate flatten(group),COUNT(data_load) as cc;};data_order = order data_count by cc desc;store data_order into'yktdata/pid06_result09/connectioncount' using PigStorage(',');

2.这个是大师姐给我提供的,将两条相连记录合并

REGISTER /home/vlab/markovPairsjar/datafu-1.2.0.jar;DEFINE MarkovPairs datafu.pig.stats.MarkovPairs();   ---××××××××××××REGISTER /home/vlab/ykt/gettimebysecond.jarDEFINE getSecondtime com.zhangdan.pig.GetTimebySecond();data_load = load 'yktdata/pid06_10.csv' using PigStorage(',') as (account:chararray,des:chararray,jntime:chararray);data_group = group data_load by account;--连接连续的两次刷卡:卡号,地点,刷卡时间,下次刷卡时间data_seq = foreach data_group{    sorted = order data_load by jntime;    pair = MarkovPairs(sorted); ---×××××××××××××    generate flatten(pair) as (elem1:TUPLE(account:chararray,des:chararray,jntime:chararray),elem2:TUPLE(account:chararray,des:chararray,jntime:chararray));}--连接连续的两次刷卡--卡号,地点,刷卡时间,下次刷卡时间,时间差data_long = foreach data_seq{     generate elem1.account as account,elem1.des as des1,RTRIM(elem1.jntime) as jnstart,elem2.des as des2,RTRIM(elem2.jntime) as jnen,getSecondtime(elem1.jntime,elem2.jntime) as resu;};--data_result = filter data_long by resu<5*60;--data_result = foreach data_seq generate flatten(elem1),flatten(elem2);store data_long into 'yktdata/combine' using PigStorage(',');

pig对于刚刚处理大量数据的人来讲真的方便好多,对于不擅长写代码的孩纸更是容易不少,

讲真,掌握一门语言如java或者python,应该可以帮我们得到任意形式的数据,千万不要仅仅依赖pig

转载于:https://www.cnblogs.com/xunyingFree/p/5282137.html

你可能感兴趣的文章
oracle 10g 数据泵导入导出
查看>>
mysql自动备份
查看>>
简单纪要:java后台实现 往url上传文件
查看>>
linux上安装svn客户端,支持https协议
查看>>
lamp的shell脚本优化
查看>>
Apache编译安装
查看>>
如何在GitHub上大显身手?
查看>>
Web JS框架等讨论
查看>>
飞机游戏软件:C语言应用初步感受
查看>>
关于多实例redis主从+Keepalived故障切换的解决方法
查看>>
关键字与标识符
查看>>
申论作答攻略:两招完成准确立意
查看>>
如何使浏览器打开时,默认的文档模式就是标准模式
查看>>
js中对数组循环的方法简单总结
查看>>
nginx 1.8.1 添加ssl模块
查看>>
团队-团队编程项目作业名称-模块测试过程
查看>>
正则表达式和grep
查看>>
[golang] 数据结构-直接插入排序
查看>>
centos7.2 KVM基本安装+嵌套虚拟化参数设置
查看>>
面试官问:ZooKeeper 一致性协议 ZAB 原理
查看>>