when try wrong result @ 'output' though copying values of 'cum' array output.
but if rename 'cum' array mentioned earlier in code. correct value of array. therefore unable reuse result values.
the device has 8 cores no shared memory.
any , comments/suggestions appreciated.
kernel void histogram(global unsigned int *input, global unsigned int *output, global unsigned int *frequency, global unsigned int *cum, unsigned int n) { int pid = get_global_id(0); //cumulative sum for(int i=0; < 16; i++) { cum[(i*16)+(2*pid)+1] = frequency[(i*16)+(2*pid)] + frequency[(i*16)+(2*pid)+1]; } barrier(clk_global_mem_fence); for(int i=0; < 32; i++) { output[(i*8)+pid] = cum[(i*8)+pid]; } barrier(clk_global_mem_fence); }
make sure understand parallel prefix sums. in particular don't see downsweep step of total sum or parts:
http://http.developer.nvidia.com/gpugems3/gpugems3_ch39.html
i'd in ti's keystone ii sdk you're using in opencl device memory read/write issue see if have scan or parallel prefix sum implementations or built in functions.