基于NIO-FIleSystem的csv解析工具包

与传统的FileReader相比，它在读取csv的能力上提供了按行和加载到内存两种思路，按行的思路就类似RandomAccessFile的逻辑

使用案例

按行读取：

public static void readCsvForLines() throws IOException {
    CSVFormat defaultFormat = CSVFormat.DEFAULT;
    File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
    CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
    int resultCount = 0;
    for (CSVRecord strings : parse) {
        String s = strings.get(0);
        resultCount += 1;
    }
    System.out.println("touch result: " + resultCount);
}

加载到内存：

public static void readCsv() throws IOException {
    CSVFormat defaultFormat = CSVFormat.DEFAULT;
    File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
    CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
    List<CSVRecord> records = parse.getRecords();
    System.out.println("touch result: " + records.size());
}

源码阅读

CSVParser

构造函数

根据案例中的静态方法，最终可以找到这个构造函数

public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
    throws IOException {
    Objects.requireNonNull(reader, "reader");
    Objects.requireNonNull(format, "format");
    this.format = format.copy();
    this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
    this.csvRecordIterator = new CSVRecordIterator();
    this.headers = createHeaders();
    this.characterOffset = characterOffset;
    this.recordNumber = recordNumber - 1;
}

这里通过CSVFormat构造了一个Lexer实例，这个东西是用来分析csv词法的

此外就是CSVRecordIterator实例

getRecords

加载所有记录到内存方法

public List<CSVRecord> getRecords() {
    return stream().collect(Collectors.toList());
}

核心方法在CSVParser#stream里面

public Stream<CSVRecord> stream() {
    return StreamSupport.stream(Spliterators.spliteratorUnknownSize(iterator(), Spliterator.ORDERED), false);
}

继续看到CSVParser#iterator

public Iterator<CSVRecord> iterator() {
    return csvRecordIterator;
}

原来是直接用StreamSupport#stream包装迭代器返回去的，这个可迭代对象的构造就在前面构造函数中

this.csvRecordIterator = new CSVRecordIterator();

CSVRecordIterator

定义在CSVParser中的一个内部类，它实现了Iterator可迭代对象

final class CSVRecordIterator implements Iterator<CSVRecord> {
    private CSVRecord current;

里面提供了一个CSVRecord属性，存放CSV的行内容

hasNext/next

迭代器实现了hasNext和next方法，在这个场景中，hasNext在哪里被调用呢？

还是在CSVParser#getRecords方法中

public List<CSVRecord> getRecords() {
    return stream().collect(Collectors.toList());
}

这里其实来到了jdk原生的Stream中，进入ReferencePipeline#collect方法会判断是否并行流，以非并行流场景为例：

else {
    container = evaluate(ReduceOps.makeRef(collector));
}

继续下钻AbstractPipeline#evaluate 方法

// ---------------------AbstractPipeline#evaluate----------------------------------
return isParallel()
       ? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
       : terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));

// ------------------------------java.util.stream.ReduceOps.ReduceOp#evaluateSequential---------------
public <P_IN> R evaluateSequential(PipelineHelper<T> helper,
                                   Spliterator<P_IN> spliterator) {
    return helper.wrapAndCopyInto(makeSink(), spliterator).get();
}

// -------------------------java.util.stream.AbstractPipeline#wrapAndCopyInto----------------------------
final <P_IN, S extends Sink<E_OUT>> S wrapAndCopyInto(S sink, Spliterator<P_IN> spliterator) {
    copyInto(wrapSink(Objects.requireNonNull(sink)), spliterator);
    return sink;
}

// -------------------------java.util.stream.AbstractPipeline#copyInto------------------------------------
final <P_IN> void copyInto(Sink<P_IN> wrappedSink, Spliterator<P_IN> spliterator) {
    Objects.requireNonNull(wrappedSink);
    if (!StreamOpFlag.SHORT_CIRCUIT.isKnown(getStreamAndOpFlags())) {
        wrappedSink.begin(spliterator.getExactSizeIfKnown());
        spliterator.forEachRemaining(wrappedSink);
        wrappedSink.end();
    }
    ……
}

// ------------------java.util.Spliterators.IteratorSpliterator#forEachRemaining-----------------------
public void forEachRemaining(Consumer<? super T> action) {
    ……
    i.forEachRemaining(action);
}

// ------------------java.util.Iterator#forEachRemaining-------------------------------------------
default void forEachRemaining(Consumer<? super E> action) {
    Objects.requireNonNull(action);
    while (hasNext())
        action.accept(next());
}

OK这里终于看到了Iterator#hasNext和Iterator#next的调用点

public boolean hasNext() {
    if (CSVParser.this.isClosed()) {
        return false;
    }
    if (current == null) {
        current = getNextRecord();
    }
    return current != null;
}

hasNext方法可以看到大致是以current作为游标，调用getNextRecord方法赋值并且判断其结果

public CSVRecord next() {
    if (CSVParser.this.isClosed()) {
        throw new NoSuchElementException("CSVParser has been closed");
    }
    CSVRecord next = current;
    current = null;
    if (next == null) {
        // hasNext() wasn't called before
        next = getNextRecord();
        if (next == null) {
            throw new NoSuchElementException("No more CSV records available");
        }
    }
    return next;
}

next方法的逻辑也是一样，以getNextRecord方法为核心，current始终指向下一个节点或null，next返回时先返回current指向的对象，然后判断，如果取到是空，则再调用getNextRecord

这里这样做是在调用方使用hasNext->next方法链路时，少一次getNextRecord方法调用

getNextRecord

private CSVRecord getNextRecord() {
    return Uncheck.get(CSVParser.this::nextRecord);
}

nextRecord

CSVParser中定义了一个概念叫Token，可以理解为一次尽可能的捞取，即捞一批文本，直到一个csv分割符为止，如果没有分割符，捞到行末应该是char类型的'\n'，即为int 10，从而实现了行读取

在Token中定义了几种类型：

enum Type {
    // 不可用，空的token就是这种
    INVALID,
    // token类型，代表还没读完的行
    TOKEN,
    // 到达文末标志
    EOF,
    // 到达行末标志
    EORECORD,
    // 看字面意思是评论，目前没有遇到过
    COMMENT
}

然后看方法

CSVRecord nextRecord() throws IOException {
    ……
    final long startCharPosition = lexer.getCharacterPosition() + characterOffset;

这里先取游标

然后是一个do-while循环，判断每个字符的标志位

do {
    reusableToken.reset();
    lexer.nextToken(reusableToken);
    ……
} while (reusableToken.type == TOKEN);

这里捞取一个token的逻辑在lexer#nextToken方法，点我跳转补链接，执行后，我们尽可能地捞取了一批内容

    switch (reusableToken.type) {
    case TOKEN:
        addRecordValue(false);
        break;
    case EORECORD:
        addRecordValue(true);
        break;
    case EOF:
        if (reusableToken.isReady) {
            addRecordValue(true);
        } else if (sb != null) {
            trailerComment = sb.toString();
        }
        break;
    case INVALID:
        throw new IOException("(line " + getCurrentLineNumber() + ") invalid parse sequence");
    case COMMENT: // Ignored currently
        if (sb == null) { // first comment for this record
            sb = new StringBuilder();
        } else {
            sb.append(Constants.LF);
        }
        sb.append(reusableToken.content);
        reusableToken.type = TOKEN; // Read another token
        break;
    default:
        throw new IllegalStateException("Unexpected Token type: " + reusableToken.type);
    }

然后是一个switch判断token是哪一类，如果是TOKEN或者EORECORD，代表是有数据的，执行addRecordValue封装数据

if (!recordList.isEmpty()) {
    recordNumber++;
    final String comment = sb == null ? null : sb.toString();
    result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY), comment,
        recordNumber, startCharPosition);
}

循环之后把result封装成CSVRecord

结合这个逻辑可知，读取一行逻辑应该是封装在Lexer里面

Lexer

承担了读读逻辑

成员变量

private final ExtendedBufferedReader reader;
// 配置项：是否忽略空行
private final boolean ignoreEmptyLines;

Lexer里面封装了一个ExtendedBufferedReader

final class ExtendedBufferedReader extends BufferedReader {
    /** The last char returned */
    private int lastChar = UNDEFINED;
    /** The count of EOLs (CR/LF/CRLF) seen so far */
    private long eolCounter;
    /** The position, which is the number of characters read so far */
    private long position;
    private boolean closed;

这个流通过一些标志位扩展BufferedReader，实现了逐字读取的能力

目录CONTENT

apache.commons.csv

使用案例

源码阅读

CSVParser

构造函数

getRecords

CSVRecordIterator

hasNext/next

getNextRecord

nextRecord

Lexer

成员变量

nextRecord

评论区