parsing - Parse Large JSON Files -

i'm working on yelp dataset challenge. data made of large son files (up 1 gb, 1mm+ lines). i'd data analytics on it, comparing data between files, e.g. linking review in review file business in business file.

i have complete freedom platform/programming language use. efficient way go this, can easy fast lookups going forward?

the son format straightforward. below example. fields "user_id" unique, , can cross-referenced other file entries.

{"votes": {"funny": 0, "useful": 2, "cool": 1},  "user_id": "xqd0dzhaiyrqvh3wrg7hzg",  "review_id": "15sdjuk7dmyquaj6rjgowg",  "stars": 5, "date": "2007-05-17",  "text": "dr. goldberg offers in general practitioner.  he's nice , easy talk without being patronizing; he's on time in seeing patients; he's affiliated top-notch hospital (nyu) parents have explained me important in case happens , need surgery; , can referrals see specialists without having see him first.  really, more need?  i'm sitting here trying think of complaints have him, i'm drawing blank.",  "type": "review",   "business_id": "vcnawilm4dr7d2nwwj7nca"}

start importing data in database.

you have option of flattening things multiple tables (if "nested" objects in json), or keep parts json, if use database can parse/index (like postgresql).

the choice of database entirely you. use classic sql database (postgresql, mysql, sql server, sqlite...), or use documented-oriented/nosql database such mongodb (which favours json-like data). it's matter of doing data (and you're comfortable with).

you can whatever data...

note if single file > 1 gb, may have use custom tools import, loading @ once in memory (through json-decoding functions of favorite language) bit much. careful, though, still want correctly parse data, avoid simplistic splits or regexes. may want @ solutions listed in thread: is there streaming api json?

Autos

Search This Blog

parsing - Parse Large JSON Files -