Tuesday, April 5, 2011

Number of lines in a file in Java

I use huge data files, sometimes I only need to know the number of lines in these files, usually I open them up and read them line by line until I reach the end of the file

I was wondering if there is a smarter way to do that

From stackoverflow
  • On Unix-based systems, use the wc command on the command-line.

    IainMH : wc -l for the line count..
    Paul : @IainmH, your second suggestion just counts the number of entries in the current directory. Not what was intended? (or asked for by the OP)
    PhiLho : @IainMH: that's what wc does anyway (reading the file, counting line-ending).
    IainMH : @PhiLho You'd have to use the -l switch to count the lines. (Don't you? - it's been a while)
    IainMH : @Paul - you are of course 100% right. My only defence is that I posted that before my coffee. I'm as sharp as a button now. :D
    Jason S : You can get wc.exe for Win32 systems: see http://unxutils.sourceforge.net/
  • Only way to know how many lines there are in file is to count them. You can of course create a metric from your data giving you an average length of one line and then get the file size and divide that with avg. length but that won't be accurate.

    Esko : Interesting downvote, no matter what command line tool you're using they all DO THE SAME THING anyway, only internally. There's no magic way to figure out the number of lines, they have to be counted by hand. Sure it can be saved as metadata but that's a whole another story...
    Richie_W : +1 to make you feel better.
  • This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.

    public int count(String filename) throws IOException {
        InputStream is = new BufferedInputStream(new FileInputStream(filename));
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
        }
        return count;
    }
    
    martinus : you were right david, I thought the JVM would be good enough for this... I have updated the code, this one is faster.
    wds : BufferedInputStream should be doing the buffering for you, so I don't see how using an intermediate byte[] array will make it any faster. You're unlikely to do much better than using readLine() repeatedly anyway (since that will be optimized towards by the API).
    martinus : Ive benchmarked it with and without the buffered inputstream, and it is afaster when using it.
    Mark : Its neat, than you so much
    bendin : You're going to close that InputStream when you're done with it, aren't you?
    Peter Lawrey : If buffering helped it would because BufferedInputStream buffers 8K by default. Increase your byte[] to this size or larger and you can drop the BufferedInputStream. e.g. try 1024*1024 bytes.
  • If you don't have any index structures, you'll not get around the reading of the complete file. But you can optimize it by avoiding to read it line by line and use a regex to match all line terminators.

    willcodejavaforfood : Sounds like a neat idea. Anyone tried it and has a regexp for it?
    PhiLho : I doubt it is such a good idea: it will need to read the whole file at once (martinus avoids this) and regexes are overkill (and slower) for such usage (simple search of fixed char(s)).
    David Schmitt : @will: what about /\n/ ? @PhiLo: good point.
  • The answer with the method count() above gave me line miscounts if a file didn't have a newline at the end of the file - it failed to count the last line in the file.

    This method works better for me:

    public int countLines(String filename) throws IOException {
        LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}
    
    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
    }
    

0 comments:

Post a Comment