基于NIO-FIleSystem的csv解析工具包
与传统的FileReader相比,它在读取csv的能力上提供了按行和加载到内存两种思路,按行的思路就类似RandomAccessFile的逻辑
使用案例
按行读取:
public static void readCsvForLines() throws IOException {
CSVFormat defaultFormat = CSVFormat.DEFAULT;
File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
int resultCount = 0;
for (CSVRecord strings : parse) {
String s = strings.get(0);
resultCount += 1;
}
System.out.println("touch result: " + resultCount);
}
加载到内存:
public static void readCsv() throws IOException {
CSVFormat defaultFormat = CSVFormat.DEFAULT;
File file = new File("C:\\Users\\me\\Desktop\\temp\test.csv");
CSVParser parse = CSVParser.parse(file, Charset.defaultCharset(), defaultFormat);
List<CSVRecord> records = parse.getRecords();
System.out.println("touch result: " + records.size());
}
源码阅读
CSVParser
构造函数
根据案例中的静态方法,最终可以找到这个构造函数
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
throws IOException {
Objects.requireNonNull(reader, "reader");
Objects.requireNonNull(format, "format");
this.format = format.copy();
this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
this.csvRecordIterator = new CSVRecordIterator();
this.headers = createHeaders();
this.characterOffset = characterOffset;
this.recordNumber = recordNumber - 1;
}
这里通过CSVFormat构造了一个Lexer实例,这个东西是用来分析csv词法的
此外就是CSVRecordIterator实例
getRecords
加载所有记录到内存方法
public List<CSVRecord> getRecords() {
return stream().collect(Collectors.toList());
}
核心方法在CSVParser#stream
里面
public Stream<CSVRecord> stream() {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(iterator(), Spliterator.ORDERED), false);
}
继续看到CSVParser#iterator
public Iterator<CSVRecord> iterator() {
return csvRecordIterator;
}
原来是直接用StreamSupport#stream
包装迭代器返回去的,这个可迭代对象的构造就在前面构造函数中
this.csvRecordIterator = new CSVRecordIterator();
CSVRecordIterator
定义在CSVParser中的一个内部类,它实现了Iterator可迭代对象
final class CSVRecordIterator implements Iterator<CSVRecord> {
private CSVRecord current;
里面提供了一个CSVRecord属性,存放CSV的行内容
hasNext/next
迭代器实现了hasNext和next方法,在这个场景中,hasNext在哪里被调用呢?
还是在CSVParser#getRecords
方法中
public List<CSVRecord> getRecords() {
return stream().collect(Collectors.toList());
}
这里其实来到了jdk原生的Stream中,进入ReferencePipeline#collect
方法会判断是否并行流,以非并行流场景为例:
else {
container = evaluate(ReduceOps.makeRef(collector));
}
继续下钻AbstractPipeline#evaluate
方法
// ---------------------AbstractPipeline#evaluate----------------------------------
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
// ------------------------------java.util.stream.ReduceOps.ReduceOp#evaluateSequential---------------
public <P_IN> R evaluateSequential(PipelineHelper<T> helper,
Spliterator<P_IN> spliterator) {
return helper.wrapAndCopyInto(makeSink(), spliterator).get();
}
// -------------------------java.util.stream.AbstractPipeline#wrapAndCopyInto----------------------------
final <P_IN, S extends Sink<E_OUT>> S wrapAndCopyInto(S sink, Spliterator<P_IN> spliterator) {
copyInto(wrapSink(Objects.requireNonNull(sink)), spliterator);
return sink;
}
// -------------------------java.util.stream.AbstractPipeline#copyInto------------------------------------
final <P_IN> void copyInto(Sink<P_IN> wrappedSink, Spliterator<P_IN> spliterator) {
Objects.requireNonNull(wrappedSink);
if (!StreamOpFlag.SHORT_CIRCUIT.isKnown(getStreamAndOpFlags())) {
wrappedSink.begin(spliterator.getExactSizeIfKnown());
spliterator.forEachRemaining(wrappedSink);
wrappedSink.end();
}
……
}
// ------------------java.util.Spliterators.IteratorSpliterator#forEachRemaining-----------------------
public void forEachRemaining(Consumer<? super T> action) {
……
i.forEachRemaining(action);
}
// ------------------java.util.Iterator#forEachRemaining-------------------------------------------
default void forEachRemaining(Consumer<? super E> action) {
Objects.requireNonNull(action);
while (hasNext())
action.accept(next());
}
OK这里终于看到了Iterator#hasNext和Iterator#next的调用点
public boolean hasNext() {
if (CSVParser.this.isClosed()) {
return false;
}
if (current == null) {
current = getNextRecord();
}
return current != null;
}
hasNext方法可以看到大致是以current作为游标,调用getNextRecord方法赋值并且判断其结果
public CSVRecord next() {
if (CSVParser.this.isClosed()) {
throw new NoSuchElementException("CSVParser has been closed");
}
CSVRecord next = current;
current = null;
if (next == null) {
// hasNext() wasn't called before
next = getNextRecord();
if (next == null) {
throw new NoSuchElementException("No more CSV records available");
}
}
return next;
}
next方法的逻辑也是一样,以getNextRecord方法为核心,current始终指向下一个节点或null,next返回时先返回current指向的对象,然后判断,如果取到是空,则再调用getNextRecord
这里这样做是在调用方使用hasNext->next方法链路时,少一次getNextRecord方法调用
getNextRecord
private CSVRecord getNextRecord() {
return Uncheck.get(CSVParser.this::nextRecord);
}
nextRecord
CSVParser中定义了一个概念叫Token,可以理解为一次尽可能的捞取,即捞一批文本,直到一个csv分割符为止,如果没有分割符,捞到行末应该是char类型的'\n',即为int 10,从而实现了行读取
在Token中定义了几种类型:
enum Type {
// 不可用,空的token就是这种
INVALID,
// token类型,代表还没读完的行
TOKEN,
// 到达文末标志
EOF,
// 到达行末标志
EORECORD,
// 看字面意思是评论,目前没有遇到过
COMMENT
}
然后看方法
CSVRecord nextRecord() throws IOException {
……
final long startCharPosition = lexer.getCharacterPosition() + characterOffset;
这里先取游标
然后是一个do-while循环,判断每个字符的标志位
do {
reusableToken.reset();
lexer.nextToken(reusableToken);
……
} while (reusableToken.type == TOKEN);
这里捞取一个token的逻辑在lexer#nextToken
方法,点我跳转补链接,执行后,我们尽可能地捞取了一批内容
switch (reusableToken.type) {
case TOKEN:
addRecordValue(false);
break;
case EORECORD:
addRecordValue(true);
break;
case EOF:
if (reusableToken.isReady) {
addRecordValue(true);
} else if (sb != null) {
trailerComment = sb.toString();
}
break;
case INVALID:
throw new IOException("(line " + getCurrentLineNumber() + ") invalid parse sequence");
case COMMENT: // Ignored currently
if (sb == null) { // first comment for this record
sb = new StringBuilder();
} else {
sb.append(Constants.LF);
}
sb.append(reusableToken.content);
reusableToken.type = TOKEN; // Read another token
break;
default:
throw new IllegalStateException("Unexpected Token type: " + reusableToken.type);
}
然后是一个switch判断token是哪一类,如果是TOKEN或者EORECORD,代表是有数据的,执行addRecordValue封装数据
if (!recordList.isEmpty()) {
recordNumber++;
final String comment = sb == null ? null : sb.toString();
result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY), comment,
recordNumber, startCharPosition);
}
循环之后把result封装成CSVRecord
结合这个逻辑可知,读取一行逻辑应该是封装在Lexer里面
Lexer
承担了读读逻辑
成员变量
private final ExtendedBufferedReader reader;
// 配置项:是否忽略空行
private final boolean ignoreEmptyLines;
Lexer里面封装了一个ExtendedBufferedReader
final class ExtendedBufferedReader extends BufferedReader {
/** The last char returned */
private int lastChar = UNDEFINED;
/** The count of EOLs (CR/LF/CRLF) seen so far */
private long eolCounter;
/** The position, which is the number of characters read so far */
private long position;
private boolean closed;
这个流通过一些标志位扩展BufferedReader,实现了逐字读取的能力
评论区