History

hmz007 36ed224bac Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a) Signed-off-by: hmz007 <hmz007@gmail.com> Change-Id: Ib87fdbe7a0be7dc961af3500fe9ea4589a127f9f		1 year ago
..
assets	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
scenarios	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
tools/identify_license	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
README.md	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
classifier.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
classifier_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
diff.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
diff_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
document.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
document_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
frequencies.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
frequencies_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
go.mod	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
go.sum	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
scoring.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
scoring_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
searchset.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
searchset_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
tokenizer.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
tokenizer_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
trace.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago
trace_test.go	Rockchip Anroid14_SDK 20240628-rkr5 (2556df1a)	1 year ago

README.md

License Classifier v2

This is a substantial revision of the license classifier with a focus on improved accuracy and performance.

Glossary

corpus dictionary - contains all the unique tokens stored in the corpus of documents to match. Any tokens in the target document that aren't in the corpus dictionary are mapped to an invalid value.
document - an internal-only data type that contains sequenced token information for a source or target content for matching.
source content - a body of text that can be matched by the scanner.
target content - the argument to Match that is scanned for matches with source content.
indexed document - an internal-only data type that maps a document to the corpus dictionary, resulting in a compressed representation suitable for fast text searching and mapping operations. an indexed document is necessarily tightly coupled to its corpus.
frequency table - a lookup table holding per-token counts of the number of times a token appears in content. used for fast filtering of target content against different source contents.
q-gram - a substring of content of length q tokens used to efficiently match ranges of text. For background on the q-gram algorithms used, please see Indexing Methods for Approximate String Matching
searchset - a data structure that uses q-grams to identify ranges of text in the target that correspond to a range of text in the source. The searchset algorithms compensate for the allowable error in matching text exactly, dealing with additional or missing tokens.

Migrating from v1

The API for the classifier versions is quite similar, but there are two key distinctions to be aware of while migrating usages.

The confidence value for the v2 classifier is applied uniformly to results; it will never return a match that is lower confidence than the threshold. In v1, MultipleMatch behaved this way, but NearestMatch would return a value regardless of the confidence match. Users often verified that the confidence was above the threshold, but this is no longer necessary.

The second change is that the classifier now returns all matches against the supplied corpus. The v1 classifier allowed filtering on header matches via a boolean field. This can be emulated by creating a license classifier with a reduced corpus if matching against headers is not desired. Alternatively, the user can use the MatchType field in the Match struct to filter out unwanted matches.