floating point - What's an efficient way to round a signed single precision float to the nearest integer? -


float input = whatever; long output = (long)(0.5f + input); 

this inefficient application on msp430, using compiler-supplied floating point addition support library.

i'm musing there may clever 'trick' particular kind of 'nearest integer' rounding, avoiding plain floating point addition perhaps 'bit twiddling' floating point representation directly, have yet find suchlike. can suggest such trick rounding ieee 754 32bit floats?

conversion bit operations straightforward, , demonstrated c code below. based on comment data types on msp430, code assumes int comprises 16 bits, , long 32 bits.

we need means of transferring bit pattern of float unsigned long efficiently possible. implementation uses union this, platform may have more efficient machine-specific ways, e.g. intrinsic. in worst case, use memcpy() copy bytes.

there few cases distinguish. can examine exponent field of float input tease them apart. if argument large, or nan, conversion fails. 1 convention return smallest negative integer operand in case. if input less 0.5, result zero. after eliminating these special cases left inputs require small bit of computation convert.

for sufficiently large arguments, float integer, in case need shift mantissa pattern correct bit position. if input small integer, convert 32.32 fixed-point format. rounding based on significant fraction bit, , in case of tie, on least significant integer bit well, since ties must rounded even.

if tie cases supposed round away zero, rounding logic in code simplifies

r = r + (t >= 0x80000000ul); 

below float_to_long_round_nearest() implements approach discussed above, along test framework tests implementation exhaustively.

#include <stdio.h> #include <stdlib.h> #include <math.h>  long float_to_long_round_nearest (float a) {     volatile union {         float f;         unsigned long i;     } cvt;     unsigned long r, ia, t, expo;      cvt.f = a;     ia = cvt.i;     expo = (ia >> 23) & 0xff;     if (expo > 157) {        /* magnitude large (>= 2**31) or nan */         r = 0x80000000ul;     } else if (expo < 126) { /* magnitude small ( < 0.5) */         r = 0x00000000ul;     } else {         int shift = expo - 150;         t = (ia & 0x007ffffful) | 0x00800000ul;         if (expo >= 150) {   /* argument integer, shift left */             r = t << shift;         } else {             r = t >> (-shift);             t = t << (32 + shift);             /* round nearest or */             r = r + ((t > 0x80000000ul) | ((t == 0x80000000ul) & (r & 1)));         }         if ((long)ia < 0) {  /* negate result if argument negative */             r = -(long)r;         }     }     return (long)r; }  long reference (float a)  {     return (long)rintf (a); }  int main (void) {      volatile union {         float f;         unsigned long i;     } arg;      long res, ref;       arg.i = 0x00000000ul;      {          res = float_to_long_round_nearest (arg.f);          ref = reference (arg.f);          if (res != ref) {              printf ("arg=%08lx % 15.8e  res=%08lx  ref=%08lx\n",                       arg.i, arg.f, res, ref);              return exit_failure;          }          arg.i++;      } while (arg.i);      return exit_success; }