hadoop - PIG Script How to -


i trying clean employee volunteer data. there no way track if employee registered volunteer can sign new volunteer , new volunteer_id. have data feeding can tie each volunteer_id emp_id. volunteer data needs cleaned can figure out how employee moved volunteer_level , when.

the business logic that, when there overlaping dates, give highest level employee timeframe of between start_date , end_date.

i posted input sample of data , output should be.

is possible pig script ? can please me

input:

emp_id  volunteer_id    v_level status  start_date  end_date 10001   100               1            1/1/2006   12/31/2007 10001   200               1            5/1/2006    10001   100               1            1/1/2008    10001   300               3      p       3/1/2008   3/1/2008 10001   300               3            3/2/2008   12/1/2008 10001   1001              2            5/1/2008   6/30/2008 10001   1001              3            7/1/2008    10001   300               2            12/2/2008   

output needed:( volunteer_id not needed in output adding below show id selected output , did not)

emp_id  volunteer_id    v_level status  start_date  end_date 10001   100              1             1/1/2006   12/31/2007 10001   300              3       p       3/1/2008   3/1/2008 10001   300              3             3/2/2008   12/1/2008 10001   1001             2             5/1/2008   6/30/2008 10001   1001             3             7/1/2008    

it seems want row in data earliest start date each v_level, status, emp_id, , volunteer_id

first add unix time column , find min column (this in latest version of pig may need update version).

data_with_unix = foreach data generate emp_id, volunteer_id, v_level, status, start_date, end_date, tounixtime((datetime)start_date) unix_time; grp = group data_with_unix (emp_id, volunteer_id, v_level, status); max_date = foreach grp generate group, min(data_with_unix.unix_time); 

then join start , end date dataset since there doesn't there way convert unix time date.