Join between Streaming data vs Historical Data in spark -


let have transaction data , visit data

visit | userid | visit source | timestamp | |      | google ads   | 1         | |      | facebook ads | 2         |  transaction | userid | total price | timestamp | |      | 100         | 248384    | | b      | 200         | 43298739  | 

i want join transaction data , visit data sales attribution. want realtime whenever transaction occurs (streaming).

is scalable join between 1 data , big historical data using join function in spark? historical data visit, since visit can anytime (e.g. visit 1 year before transaction occurs)

i did join of historical data , streaming data in project. here problem have cache historical data in rdd , when streaming data comes, can join operations. long process.

if updating historical data, have keep 2 copies , use accumulator work either copy @ once, wont affect the second copy.

for example,

transactionrdd stream rdd running @ interval. visitrdd historical , update once day. have maintain 2 databases visitrdd. when updating 1 database, transactionrdd can work cached copy of visitrdd , when visitrdd updated, switch copy. complicated.