statistics - Cannot generalize my Genetic Algorithm to new Data -


i've written ga model handful of stocks (4) on period of time (5 years). it's impressive how ga can find optimal solution training data, aware due it's tendency over-fit in training phase.

however, still thought take few precautions , and kind of prediction on set of unseen test stocks same period.

one precaution took was: when multiple stocks can bought on same day ga buys 1 list , chooses 1 randomly. thought randomness might avoid over-fitting?

even if over-fitting still occurring,shouldn't absent in initial generations of ga since hasn't had chance over-fit yet?

as note, aware of no-free-lunch theorem demonstrates ( believe) there no perfect set of parameters produce optimal output 2 different datasets. if take further, no-free-lunch theorem prohibit generalization?

the graph below illustrates this. ->the blue line ga output. ->the red line training data (slightly different because of aforementioned randomness) -> yellow line stubborn test data shows no generalization. in fact flattering graph produce..

the y-axis profit, x axis trading strategies sorted worst best ( left right) according there respective profits (on y axis) enter image description here

some of best advice i've received far (thanks seaotternerd) focus on earlier generations , increase number of training examples. graph below has 12 training stocks rather 4, , shows first 200 generations (instead of 1,000). again, it's flattering chart produce, time medium selection pressure. looks little bit better, not fantastic either. red line test data.

enter image description here

the problem over-fitting that, within single data-set it's pretty challenging tell over-fitting apart getting better in general case. in many ways, more of art science, here general guidelines:

  • a ga learn attach fitness to. if tell @ predicting 1 series of stocks, that. if keep swapping in different stocks predict, though, might more successful @ getting generalize. there few ways this. 1 has had perhaps promising results reducing over-fitting imposing spatial structure on population , evaluating on different test cases in different cells, in scalp algorithm. switch out test cases on time basis, i've had more mixed results sort of approach.
  • you correct over-fitting should less of problem on. generally, longer run ga, more over-fitting possible. typically, people tend assume general rules learned first, before rote memorization of over-fitting takes place. however, don't think i've ever seen studied rigorously - imagine scenario over-fitting easier finding general rules happens first. have no idea how common is, though. stopping reduce ability of ga find better general solutions.
  • using larger data-set (four stocks isn't many) make ga less susceptible over-fitting.
  • randomness interesting idea. hurt ga's ability find general rules, should reduce over-fitting. without knowing more specifics of algorithm, it's hard win out.
  • that's interesting thought no free lunch theorem. i'm not 100% sure, think apply here extent - better fitting data make results fit other data worse, necessity. however, wide range of possible stock behaviors is, narrower range of possible time series in general. why possible have optimization algorithms @ - given problem working tends produce data cluster relatively closely together, relative entire space of possible data. so, within set of inputs care about, possible better. there upper limit of sort on how can do, , possible have hit upper limit data-set. generalization possible extent, wouldn't give yet.

bottom line: think varying test cases shows promise (although i'm biased, because that's 1 of primary areas of research), challenging solution, implementation-wise. simpler fix can try stopping evolution sooner or increasing data-set.