database - Cassandra Batches with if not exists condition -


when i'm sending batch of inserts 1 table while each row unique key condition if not exists , there problem when if 1 of rows exists.

i need insert batch per row , not per whole batch. let's i've table "users" 1 column "user_name" , contains row "jhon", i'm trying import new users:

begin batch insert "users" ("user_name") values ("jhon") if not exists; insert "users" ("user_name") values ("mandy") if not exists; apply batch; 

it not insert "mandy" because "jhon" exists, can isolate them?

i've lot of rows insert 100-200k need use batch.

thanks!

first: describe documented intended behavior:

in cassandra 2.0.6 , later, can batch conditional updates introduced lightweight transactions in cassandra 2.0. updates made same partition can included in batch because underlying paxos implementation works @ granularity of partition. can group updates have conditions not, when single statement in batch uses condition, entire batch committed using single paxos proposal, if of conditions contained in batch apply.

that confirms: updates different partitions, 1 paxos proposal going used, means entire batch succeed, or none of will.

that said, cassandra, batches aren't meant speed , bulk load - they're meant create pseudo-atomic logical operations. http://docs.datastax.com/en/cql/3.1/cql/cql_using/usebatch.html :

batches mistakenly used in attempt optimize performance. unlogged batches require coordinator manage inserts, can place heavy load on coordinator node. if other nodes own partition keys, coordinator node needs deal network hop, resulting in inefficient delivery. use unlogged batches when making updates same partition key.

the coordinator node might need work hard process logged batch while maintaining consistency between tables. example, upon receiving batch, coordinator node sends batch logs 2 other nodes. in event of coordinator failure, other nodes retry batch. entire cluster affected. use logged batch synchronize tables, shown in example:

in schema, each insert different partition, going add lot of load on coordinator.

you can run 200k inserts client async executes, , they'll run quite fast - fast (or faster) you'd see batch.