let have transaction data , visit data
visit | userid | visit source | timestamp | | | google ads | 1 | | | facebook ads | 2 | transaction | userid | total price | timestamp | | | 100 | 248384 | | b | 200 | 43298739 |
i want join transaction data , visit data sales attribution. want realtime whenever transaction occurs (streaming).
is scalable join between 1 data , big historical data using join function in spark? historical data visit, since visit can anytime (e.g. visit 1 year before transaction occurs)
i did join of historical data , streaming data in project. here problem have cache historical data in rdd , when streaming data comes, can join operations. long process.
if updating historical data, have keep 2 copies , use accumulator work either copy @ once, wont affect the second copy.
for example,
transactionrdd stream rdd running @ interval. visitrdd historical , update once day. have maintain 2 databases visitrdd. when updating 1 database, transactionrdd can work cached copy of visitrdd , when visitrdd updated, switch copy. complicated.