i'm working on yelp dataset challenge. data made of large son files (up 1 gb, 1mm+ lines). i'd data analytics on it, comparing data between files, e.g. linking review in review file business in business file.
i have complete freedom platform/programming language use. efficient way go this, can easy fast lookups going forward?
the son format straightforward. below example. fields "user_id" unique, , can cross-referenced other file entries.
{"votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "xqd0dzhaiyrqvh3wrg7hzg", "review_id": "15sdjuk7dmyquaj6rjgowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers in general practitioner. he's nice , easy talk without being patronizing; he's on time in seeing patients; he's affiliated top-notch hospital (nyu) parents have explained me important in case happens , need surgery; , can referrals see specialists without having see him first. really, more need? i'm sitting here trying think of complaints have him, i'm drawing blank.", "type": "review", "business_id": "vcnawilm4dr7d2nwwj7nca"}
start importing data in database.
you have option of flattening things multiple tables (if "nested" objects in json), or keep parts json, if use database can parse/index (like postgresql).
the choice of database entirely you. use classic sql database (postgresql, mysql, sql server, sqlite...), or use documented-oriented/nosql database such mongodb (which favours json-like data). it's matter of doing data (and you're comfortable with).
you can whatever data...
note if single file > 1 gb, may have use custom tools import, loading @ once in memory (through json-decoding functions of favorite language) bit much. careful, though, still want correctly parse data, avoid simplistic splits or regexes. may want @ solutions listed in thread: is there streaming api json?