i trying clean employee volunteer data. there no way track if employee registered volunteer can sign new volunteer , new volunteer_id. have data feeding can tie each volunteer_id emp_id. volunteer data needs cleaned can figure out how employee moved volunteer_level , when.
the business logic that, when there overlaping dates, give highest level employee timeframe of between start_date , end_date.
i posted input sample of data , output should be.
is possible pig script ? can please me
input:
emp_id volunteer_id v_level status start_date end_date 10001 100 1 1/1/2006 12/31/2007 10001 200 1 5/1/2006 10001 100 1 1/1/2008 10001 300 3 p 3/1/2008 3/1/2008 10001 300 3 3/2/2008 12/1/2008 10001 1001 2 5/1/2008 6/30/2008 10001 1001 3 7/1/2008 10001 300 2 12/2/2008
output needed:( volunteer_id not needed in output adding below show id selected output , did not)
emp_id volunteer_id v_level status start_date end_date 10001 100 1 1/1/2006 12/31/2007 10001 300 3 p 3/1/2008 3/1/2008 10001 300 3 3/2/2008 12/1/2008 10001 1001 2 5/1/2008 6/30/2008 10001 1001 3 7/1/2008
it seems want row in data earliest start date each v_level
, status
, emp_id
, , volunteer_id
first add unix time column , find min column (this in latest version of pig may need update version).
data_with_unix = foreach data generate emp_id, volunteer_id, v_level, status, start_date, end_date, tounixtime((datetime)start_date) unix_time; grp = group data_with_unix (emp_id, volunteer_id, v_level, status); max_date = foreach grp generate group, min(data_with_unix.unix_time);
then join start , end date dataset since there doesn't there way convert unix time date.