so have 1 large file contains bunch of weather data. have allocate each line large file corresponding state file. there total of 50 new state files own data.
the large file contains ~1 million lines of records this:
coop:166657,'new iberia airport acadiana regional la us',200001,177,553
although name of station can vary , have different number of words.
this regular expression using:
pattern p = pattern.compile(".* ([a-z][a-z]) us.*"); matcher m = p.matcher(line);
when run program there still instances of lines in pattern not found.
this program:
package climate; import java.io.bufferedreader; import java.io.file; import java.io.filereader; import java.io.filewriter; import java.io.ioexception; import java.util.arrays; import java.util.scanner; import java.util.regex.matcher; import java.util.regex.pattern; /** * program read in large file containing many stations , states, * , output in order stations corresponding state file. * * note: take long time depending on processor. appends data * files must remove state files in current directory * before running accuracy. * * @author marcus * */ public class climatecleanstates { public static void main(string[] args) throws ioexception { scanner in = new scanner(system.in); system.out .println("note: program can take long time depending on processor."); system.out .println("it not necessary run state files in directory."); system.out .println("but if see how works, may continue."); system.out.println("please remove state files before running."); system.out.println("\nis states directory empty?"); string answer = in.nextline(); if (answer.equals("n")) { system.exit(0); in.close(); } system.out.println("would run program?"); string answer2 = in.nextline(); if (answer2.equals("n")) { system.exit(0); in.close(); } string[] statesspaced = new string[51]; file statefile, dir, infile; // create files each states dir = new file("states"); dir.mkdir(); infile = new file("climatedata.csv"); filereader fr = new filereader(infile); bufferedreader br = new bufferedreader(fr); string line; system.out.println(); // read in climatedata.csv // need implement climaterecord class final long start = system.currenttimemillis(); while ((line = br.readline()) != null) { // remove instances of -9999 if (!line.contains("-9999")) { pattern p = pattern.compile("^.* ([a-z][a-z]) us.*$"); matcher m = p.matcher(line); string statefilename = null; if(m.find()){ //system.out.println(m.group(1)); statefilename = m.group(1); } else { system.out.println("could not find abbreviation"); } /* statefilename = "states/" + statefilename + ".csv"; statefile = new file(statefilename); filewriter statewriter = new filewriter(statefile, true); statewriter.write(line + "\n"); // progress reporting system.out.printf("writing [%s] file [%s]\n", line, statefile); statewriter.flush(); statewriter.close(); */ } } system.out.println("elapsed " + (system.currenttimemillis() - start) + " ms"); br.close(); fr.close(); in.close(); } }
i think need around functions, assert should precede or follow expression you're matching not included in result.
(?<= )[a-z][a-z](?= us)
(?<= )
must space before
[a-z][a-z]
2 capital letters
(?= us)
must space , letters after
it might pay more robust around: (?= us) (?= us',) instance.