Inverted index

Emmanuel Bernard

What are we here for

Discuss inverted index technology and fundamentals

You can read the speaker notes by pressing s

Emmanuel Bernard

Consulting Software Engineer, Red Hat
Chief Architect
Open Source coder:

  • Hibernate {ORM,OGM,Search,Validator}

  • Debezium, Infinispan, …​

Agenda

Why
B-tree vs inverted index
Building & using an inverted index
Indexing
Querying
Scoring
Physical representation

Why do we need inverted index

Where is it used

Classical:

  • Google, DuckDuckGo

  • Mobile phone search

  • IDE auto-completion

Less classical

  • geolocation

  • suggestion

  • faceting

  • more like this / classification

  • machine learning

Let’s try on RDBMSes

SELECT * FROM Speakers s WHERE s.bio = 'Open Source'

How is the index built?

A zoom on B-Trees

Self-balancing tree data stucture
Search, insert and delete in O(log n)
Good when access time >> process time

Replace numbers in the example with letters for text.

b tree base

Splitting leaf nodes

b tree base
b tree 1

Simple addition

b tree 1
b tree 2

Splitting nodes

b tree 2
b tree 3

Filling up a leaf node

b tree 3
b tree 4

Splitting nodes and filling up the root

b tree 4
b tree 5

Filling up leaves

b tree 5
b tree 6

Adding one level

b tree 6
b tree 7

B+-tree

Only keys in the non leaf nodes
Leaf nodes linked with one another for efficient ascending reading
Data can be just pointer to the real data

XKCD: Tree

link:

Not only is that terrible in general, but you just KNOW Billy’s going to open the root present first, and then everyone will have to wait while the heap is rebuilt.

Back to our (B-Tree) RDBMS vs inverted indices

With like

SELECT * FROM Speakers s WHERE s.bio LIKE 'Open Source%'

With like we can have more text after
Still using indices

With like in the middle of the column

SELECT * FROM Speakers s WHERE s.bio LIKE '%Open Source%'

Find word anywhere in the text

Table or index scan :(

What about uppercase, typos etc

SELECT * FROM Speakers s WHERE
    s.bio LIKE '%open source%'
    OR s.bio LIKE '%Open Source%'
    OR s.bio LIKE '%opan surce%'

Can’t anticipate the casing
Can’t anticipate all typos

What about word ordering and priority

SELECT * FROM Speakers s WHERE
    s.bio LIKE '%source open%'
    OR s.bio LIKE '%source%'
    OR s.bio LIKE '%open%'
    ORDER BY best??

Words could be in any order
I want the most interesting result first

Caveat on RDBMSes

Some have powerful indexing techniques
Some even full-text related

Tend to have less flexibility than dedicated inverted index tools

Building & using an inverted index

Inverted index to the rescue

Let’s not index column values but words
Let’s not query values but words

At indexing time

doc1: I am your father Luke
doc2: Yes he is your father
doc3: I am gonna make him an offer he can not refuse.
doc4: I love the smell of napalm in the morning.
doc5: One morning I shot an elephant in my pajamas. How he got in my pajamas, I do not know.

worddocuments

am

1,3

an

3,5

can

3

do

5

elephant

5

father

1,2

gonna

3

got

5

he

2,3,5

him

3

how

5

i

1,3,4,5

in

4,5

is

2

know

5

love

4

luke

1

make

3

morning

4,5

my

5

not

3,5

napalm

4

of

4

offer

3

one

5

pajamas

5

refuse

3

shot

5

smell

4

the

4

yes

2

your

1,2

At query time

query: father napalm
Apply the same word splitting logic
Matching documents: 1, 2 and 4

worddocuments

father

1,2

napalm

4

Indexing details

Transforming sentences into words

Analyzers

  1. pre-tokenization

  2. tokenization

  3. filter

Apply the same logic to both document and query content
Each token is the entry in the inverted index pointing to documents

Pre-tokenization

Remove unnecessary characters
e.g. remove HTML tags

<p>This is <strong>awesome</strong>.</p>
This is awesome.

Tokenization

Split sentence into words called tokens
Split at spaces, dots and other punctuations (with exceptions)

aujourd’hui, A.B.C., and many other rules

One tokenizer per language, but many languages are similar

Continuous scripting

Didyouknowwritingtextsinwordsseparatedbyspaceisnotthatold
itstartedinthemiddleage
Itwasnotaproblemaspeoplewerereadingoutloudwrittentext
Infactsplittingwordswasaninventionnecessary
becausemonksshouldremainsilentandlatinwasnolongertheirnativetongue

Filtering: where the magic happens

Operate on the stream of tokens
Change, remove or even add tokens

lowercase, stopwords

Sentence: This is AWESOME Peter!
Tokens: |This|is|AWESOME|Peter|
stopwords: |AWESOME|Peter|
lowercase: |awesome|peter|

Solving various problems with filters

Synonyms

When the text mentions a "car" but the research is about "automobile" or "vehicle"
We need a synonym dictionary

Synonym solution

  1. Put all synonyms in the index for each word

  2. Use a reference synonym ("automobile" for "car", "compact", "auto", "S.U.V."…​)

  3. Index normally, use synonyms when building the query

Words from the same family

education, educates, educated, …​
That would make for lots of synonyms…​
Let’s use a stemming algorithm

An algorithm to copy language logic (and exceptions)

Porter stemming algorithm
Snowball grammar
French algorithm explained

Index/query the stem when the word is found

wordstem

main

main

mains

main

maintenaient

mainten

maintenait

mainten

maintenant

mainten

maintenir

mainten

maintenue

mainten

maintien

maintien

Finding words with typos

People make mistakes
In the text or in the query

They make thaipo and other mystakes

Phonetic algorithm

Same logic as stemming, convert word into phonetic approximation
Soundex, RefinedSoundex, Metaphone, DoubleMetaphone

n-gram

Split a word into a sliding window of n characters
Index each n-gram

// building a 3 gram
mystake: mys yst sta tak ake
mistake: mis ist sta tak ake

Low n means more false positives
High n means less forgiving

Fuzzy search in practice

Compute distance between word and all words in index
or
Compute a distance state machine for word
Use it to check specific terms in the index

ne: n consumed chars, e errors
horizontal: unmodified chars
* vertical: addition
* diagonal: substitution
ε diagonal: deletion

levenstein nfa food

You can index the same data in different ways

Apply different indexing approach for same data

Querying time

It’s term query all the way down!
All queries (synonyms, phonetic, n-gram, fuzzy) are a (set of) term queries

Possible queries

Term, wildcard, prefix, fuzzy, phrase, range, boolean, all, spatial, more like this, spell checking

PhraseQuery vs shingles

Find exact sentences
or find words near one another (sloppiness)

"Laurel and Hardy"

PhraseQuery uses positional information

Shingles uses n-grams but per tokens not per chars

Scoring

xkcd scoring

Found results but in random order…​

We want the most relevant results first
This is relative
Several approaches, none perfect

Main levers for a scoring formulae

Term frequency

How often does the term appear in this document?
More is better

Inverse document frequency

How often does the term appear in all documents in the collection?
Common words are less important

Field-length norm

How long is the field?
Long documents would be favored otherwise

Coordination factor

If document contains multiple terms, it’s a better fit.

TF/IDF Full formulae

\$"score"(q,d) = "queryNorm"(q) * "coord"(q,d) * sum_(t in q) ( tf(t in d) * idf(t)^2 * "t.boost" * "norm"(d) )\$
\$"queryNorm"(q) = 1/sqrt(sum_(t in q) (idf(t)^2))\$
\$"coord"(q,d) = ("nbrOfmatchingTerm"(q in d))/("nbrOfTerms"(q))\$
\$tf(t in d) = sqrt(nbrOfTermAppearance(t in d))\$
\$idf(t) = 1 + log ( "numDocs" / ("numDocs"(t in d) + 1))\$
\$"norm"(d) = 1/sqrt( "nbrOfTerms"(d) )\$

Lucene scoring based on

Boolean model
Vector space model
Term Frequency / Inverted Document Frequency

Other scoring

Boosting fields
Positional (phrase query) or similarity (fuzzy) information
Feedback function (external or internal)

Okapi BM25
Your custom scoring function (or a tweak of)

Inverted index physical representation

A Lucene example

What is Lucene

Search engine library
Used by many, including

  • Solr

  • Elasticsearch

  • Hibernate Search

B-tree’s problems

When you need write throughput
B-tree requires lots of updates in place

Sequential reads are much faster than random reads

  • on disk

  • kinda also in memory (L1-3 caches)

Append logs

Append operations in a file
Reading requires reading all the log

Log-Structured Merge

Per batch of writes, create a file storing the sorted key/value pairs
On read, check for the key on each file
Regularly merge files together (e.g. make bigger files)

Log-Structured Merge Tree

LSM characteristics

Immutable (lock-free) and file cache friendly
Fast on write, decent on read
Sequential read/write friendly
Read time decays with number of files ⇒ merge

Lots of ways to improve them

Page index in memory
Bloom filter
Level-based compaction

level-based compaction for LSM tree

Log-Structured Merge Tree

level-based compaction characteristics

Limit the number of files to read
One file per level to be consulted
Compact to the higher levels
Each file per level has non overlapping key ranges

Lucene’s case

LSM
Everything is computed upfront
Each segment is a mini index
Denormalize differently depending on access pattern

A segment (simplified)

  • term index (like a ToC for the dictionary)

  • term dictionary (points to posting list offset)

  • posting list (list of matching document id per term)

  • stored field index (sparse doc id + offset)

  • stored field data (list of field values per document id)

  • deleted documents

Term index

Term index provides offset to the dictionary
Based on finite state transducers
Gives one ordinal per prefix

We know where to look in the term dictionary

FSTExample

FST for mop, moth, pop, star, stop and top

mop=0
moth=1
pop=2
star=3
stop=4
top=5

Term dictionary

From a given offset (& prefix)
Sorted list of suffixes
For each, frequency and offset to posting list

[prefix=top]
_null_, freq=27, offset=234
ography, freq=1, offset=298
ology, freq=6, offset=306
onyms, freq=1, offset=323

Posting list

List of document ids
Variable int encoding
Encoded as delta of ids (good for variable int encoding)

4,6,9,30,33,39,45 => 4,2,3,23,3,6,6

PForDelta encoding
Bigger in size but less CPU branch miss prediction
Also pulse optimization for frequency=1

Stored fields

Stored field index in memory doc id + offset for every 16k of data
Stored value stored as bocks of 16k and compressed

stored fields

Deleted documents

You said segments are immutable
What about deleted documents?

Deleted document file

  1. 1 bit per doc

  2. sparse list of cleared docs

Why oh why such a mess?

xkcd lisp

2 disk seeks per field search (binary search)
1 disk seek per doc for stored fields

But things likely fit in file system cache

Warning: this is a simplified view :)

Subjects not covered

Uninverted index

Columnar storage
Called doc values
Used for aggregation or sorting or faceting

Faceting

Geospatial queries

Term vector

And many more things

Thank you!

License

by sa

A couple of drawings are copyright of their respective author (linked in the references).

References