目 录CONTENT

文章目录

apache.commons.csv

FatFish1
2025-01-07 / 0 评论 / 0 点赞 / 64 阅读 / 0 字 / 正在检测是否收录...

基于NIO-FIleSystem的csv解析工具包

与传统的FileReader相比,它在读取csv的能力上提供了按行和加载到内存两种思路,按行的思路就类似RandomAccessFile的逻辑

使用案例

按行读取:

public static void readCsvForLines() throws IOException {
    CSVFormat defaultFormat = CSVFormat.DEFAULT;
    File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
    CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
    int resultCount = 0;
    for (CSVRecord strings : parse) {
        String s = strings.get(0);
        resultCount += 1;
    }
    System.out.println("touch result: " + resultCount);
}

加载到内存:

public static void readCsv() throws IOException {
    CSVFormat defaultFormat = CSVFormat.DEFAULT;
    File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
    CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
    List<CSVRecord> records = parse.getRecords();
    System.out.println("touch result: " + records.size());
}

源码阅读

CSVParser

构造函数

根据案例中的静态方法,最终可以找到这个构造函数

public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
    throws IOException {
    Objects.requireNonNull(reader, "reader");
    Objects.requireNonNull(format, "format");
    this.format = format.copy();
    this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
    this.csvRecordIterator = new CSVRecordIterator();
    this.headers = createHeaders();
    this.characterOffset = characterOffset;
    this.recordNumber = recordNumber - 1;
}

这里通过CSVFormat构造了一个Lexer实例,这个东西是用来分析csv词法的

此外就是CSVRecordIterator实例

getRecords

加载所有记录到内存方法

public List<CSVRecord> getRecords() {
    return stream().collect(Collectors.toList());
}

核心方法在CSVParser#stream里面

public Stream<CSVRecord> stream() {
    return StreamSupport.stream(Spliterators.spliteratorUnknownSize(iterator(), Spliterator.ORDERED), false);
}

继续看到CSVParser#iterator

public Iterator<CSVRecord> iterator() {
    return csvRecordIterator;
}

原来是直接用StreamSupport#stream包装迭代器返回去的,这个可迭代对象的构造就在前面构造函数中

this.csvRecordIterator = new CSVRecordIterator();

CSVRecordIterator

定义在CSVParser中的一个内部类,它实现了Iterator可迭代对象

final class CSVRecordIterator implements Iterator<CSVRecord> {
    private CSVRecord current;

里面提供了一个CSVRecord属性,存放CSV的行内容

hasNext/next

迭代器实现了hasNext和next方法,在这个场景中,hasNext在哪里被调用呢?

还是在CSVParser#getRecords方法中

public List<CSVRecord> getRecords() {
    return stream().collect(Collectors.toList());
}

这里其实来到了jdk原生的Stream中,进入ReferencePipeline#collect方法会判断是否并行流,以非并行流场景为例:

else {
    container = evaluate(ReduceOps.makeRef(collector));
}

继续下钻AbstractPipeline#evaluate 方法

// ---------------------AbstractPipeline#evaluate----------------------------------
return isParallel()
       ? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
       : terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
// ------------------------------java.util.stream.ReduceOps.ReduceOp#evaluateSequential---------------
public <P_IN> R evaluateSequential(PipelineHelper<T> helper,
                                   Spliterator<P_IN> spliterator) {
    return helper.wrapAndCopyInto(makeSink(), spliterator).get();
}
// -------------------------java.util.stream.AbstractPipeline#wrapAndCopyInto----------------------------
final <P_IN, S extends Sink<E_OUT>> S wrapAndCopyInto(S sink, Spliterator<P_IN> spliterator) {
    copyInto(wrapSink(Objects.requireNonNull(sink)), spliterator);
    return sink;
}
// -------------------------java.util.stream.AbstractPipeline#copyInto------------------------------------
final <P_IN> void copyInto(Sink<P_IN> wrappedSink, Spliterator<P_IN> spliterator) {
    Objects.requireNonNull(wrappedSink);
    if (!StreamOpFlag.SHORT_CIRCUIT.isKnown(getStreamAndOpFlags())) {
        wrappedSink.begin(spliterator.getExactSizeIfKnown());
        spliterator.forEachRemaining(wrappedSink);
        wrappedSink.end();
    }
    ……
}
// ------------------java.util.Spliterators.IteratorSpliterator#forEachRemaining-----------------------
public void forEachRemaining(Consumer<? super T> action) {
    ……
    i.forEachRemaining(action);
}
// ------------------java.util.Iterator#forEachRemaining-------------------------------------------
default void forEachRemaining(Consumer<? super E> action) {
    Objects.requireNonNull(action);
    while (hasNext())
        action.accept(next());
}

OK这里终于看到了Iterator#hasNext和Iterator#next的调用点

public boolean hasNext() {
    if (CSVParser.this.isClosed()) {
        return false;
    }
    if (current == null) {
        current = getNextRecord();
    }
    return current != null;
}

hasNext方法可以看到大致是以current作为游标,调用getNextRecord方法赋值并且判断其结果

public CSVRecord next() {
    if (CSVParser.this.isClosed()) {
        throw new NoSuchElementException("CSVParser has been closed");
    }
    CSVRecord next = current;
    current = null;
    if (next == null) {
        // hasNext() wasn't called before
        next = getNextRecord();
        if (next == null) {
            throw new NoSuchElementException("No more CSV records available");
        }
    }
    return next;
}

next方法的逻辑也是一样,以getNextRecord方法为核心,current始终指向下一个节点或null,next返回时先返回current指向的对象,然后判断,如果取到是空,则再调用getNextRecord

这里这样做是在调用方使用hasNext->next方法链路时,少一次getNextRecord方法调用

getNextRecord

private CSVRecord getNextRecord() {
    return Uncheck.get(CSVParser.this::nextRecord);
}

nextRecord

CSVParser中定义了一个概念叫Token,可以理解为一次尽可能的捞取,即捞一批文本,直到一个csv分割符为止,如果没有分割符,捞到行末应该是char类型的'\n',即为int 10,从而实现了行读取

在Token中定义了几种类型:

enum Type {
    // 不可用,空的token就是这种
    INVALID,
    // token类型,代表还没读完的行
    TOKEN,
    // 到达文末标志
    EOF,
    // 到达行末标志
    EORECORD,
    // 看字面意思是评论,目前没有遇到过
    COMMENT
}

然后看方法

CSVRecord nextRecord() throws IOException {
    ……
    final long startCharPosition = lexer.getCharacterPosition() + characterOffset;

这里先取游标

然后是一个do-while循环,判断每个字符的标志位

do {
    reusableToken.reset();
    lexer.nextToken(reusableToken);
    ……
} while (reusableToken.type == TOKEN);

这里捞取一个token的逻辑在lexer#nextToken方法,点我跳转补链接,执行后,我们尽可能地捞取了一批内容

    switch (reusableToken.type) {
    case TOKEN:
        addRecordValue(false);
        break;
    case EORECORD:
        addRecordValue(true);
        break;
    case EOF:
        if (reusableToken.isReady) {
            addRecordValue(true);
        } else if (sb != null) {
            trailerComment = sb.toString();
        }
        break;
    case INVALID:
        throw new IOException("(line " + getCurrentLineNumber() + ") invalid parse sequence");
    case COMMENT: // Ignored currently
        if (sb == null) { // first comment for this record
            sb = new StringBuilder();
        } else {
            sb.append(Constants.LF);
        }
        sb.append(reusableToken.content);
        reusableToken.type = TOKEN; // Read another token
        break;
    default:
        throw new IllegalStateException("Unexpected Token type: " + reusableToken.type);
    }

然后是一个switch判断token是哪一类,如果是TOKEN或者EORECORD,代表是有数据的,执行addRecordValue封装数据

if (!recordList.isEmpty()) {
    recordNumber++;
    final String comment = sb == null ? null : sb.toString();
    result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY), comment,
        recordNumber, startCharPosition);
}

循环之后把result封装成CSVRecord

结合这个逻辑可知,读取一行逻辑应该是封装在Lexer里面

Lexer

承担了读读逻辑

成员变量

private final ExtendedBufferedReader reader;
// 配置项:是否忽略空行
private final boolean ignoreEmptyLines;

Lexer里面封装了一个ExtendedBufferedReader

final class ExtendedBufferedReader extends BufferedReader {
    /** The last char returned */
    private int lastChar = UNDEFINED;
    /** The count of EOLs (CR/LF/CRLF) seen so far */
    private long eolCounter;
    /** The position, which is the number of characters read so far */
    private long position;
    private boolean closed;

这个流通过一些标志位扩展BufferedReader,实现了逐字读取的能力

nextRecord

0

评论区