A Swiss Geek previously in Singapore, now in Portugal

Serverless Search with Lambda

You want to add the capability to add a search functionality to your application, but the storage your are currently using doesn’t offer an easy out-of-the-box way to address this.

You might want to be tempted to use solutions like ElasticSearch, AWS OpenSearch or any Search Infrastructure Service, which often is the right choice.

What if your needs don’t require the heavy lifting and the cost of full blown search infrastructures? Why not use a Serverless based search solution?

I had my eyes on LunrJS since a while to provide search capabilities on a static website. LunrJS is primarily made for browser integrations, but since it also works in Node, why not integrate it with a Lambda function and provide search as an API?

Pre-Build and share the indexes

Indexing the data is the most time consuming operation. We need to pre-build the indexes for each search request and store them in a shared storage.

We will be using Amazon S3 for this. We could also use Amazon EFS for faster retrievals. But by caching the index in memory, we only get penalized on the first load.

Memory Limits

The index is loaded into memory, but the result of a search query only returns the elements identifiers.

To return all the fields of the resulting documents, they would need to be fetched from the source. This also is loaded into memory.

Since Lambda can have a maximum of 10GB of memory, if the sizes for your index and documents exceeds Lambda’s memory, this solution isn’t for you.

You could retrieve the documents from the storage each time using a stream, but this will come with a latency penalty.

The Solution

The application is separated in 2 independent parts: indexing and searching.

Indexing and Search are separated functionalities

Full Re-Index

Creating an index is a full re-index, meaning that you need to be able to scan your entire DynamoDB table and re-index all documents. This could become costly if the process is triggered on every change. You can mitigate this by running the indexing on a schedule, leaving you source data and search results out of sync during a period.

Partial Indexing

With the help of the lunr-mutable-indexes extension, we can listen to DynamoDB Streams and update our index for every row change without the need to re-index the whole dataset.

Indexes generated with lunr-mutable-indexes are slightly bigger, but are directly usable by lunr.

Let’s build it

We will use the serverless.com framework to build our application. Source Code is available on Github.

We will use source files as CSV uploaded to a bucket for the searchable data.

S3 Bucket to store sources and indexes

Using Cloudformation, we provision a bucket and enable EventBridge to allow listening to new incoming files events.

Type: AWS::S3::Bucket
Properties:
  BucketName: serverless-search-${aws:accountId}-${aws:region}
  NotificationConfiguration:
    EventBridgeConfiguration:
      EventBridgeEnabled: true

Building Index

Lambda is invoked when a new csv is uploaded to S3. Memory and Timeout are set to high values.

handler: src/handlers/indexBuild.handler
name: ${self:service}-index-build
memorySize: 5120
timeout: 900
environment:
  BUCKET:
    Ref: S3BucketSearch
events:
  - eventBridge:
      pattern:
        source:
          - aws.s3
        detail-type:
          - Object Created
        detail:
          bucket:
            name:
              - Ref: S3BucketSearch
          object:
            key:
              - prefix: 'source/'

The function reads the source CSV file, converts it into JSON, builds and stores the index.

import internal from 'stream'
import {
  GetObjectCommand,
  PutObjectCommand,
  PutObjectCommandInput,
  S3Client,
} from '@aws-sdk/client-s3'
import csv from 'csvtojson'
import lunr from 'lunr'

const bucket = process.env.BUCKET
const indexPrefix = 'indexes/'

export const handler = async (event: S3ObjectCreatedNotificationEvent): Promise<void> => {
  const key = event.detail.object.key

  const sourceName = path.basename(key, path.extname(key))

  const command = new GetObjectCommand({
    Bucket: params.bucket,
    Key: params.key,
  })
  const res = await client.send(command)
  if (!(res.Body instanceof internal.Readable)) {
    return
  }

  const sourceStream = res.Body

  let fields: string[] = []
  const allData = await csv()
    .on('header', (header) => {
      fields = header
    })
    .fromStream(sourceStream)

  const idx: lunr.Index = lunr(function () {
    this.ref(fields[0])
    fields.slice(1).forEach((f) => {
      this.field(f)
    })

    params.documents.forEach((document) => {
      this.add(document)
    })
  })

  await Promise.all([
    store({
      bucket,
      content: JSON.stringify(idx.toJSON()),
      filename: sourceName,
      fileType: 'idx',
    }),
    store({
      bucket,
      content: JSON.stringify(allData),
      filename: sourceName,
      fileType: 'doc',
    })
  ])
}

const store = async (params: Store): Promise<void> => {
  const s3Params: PutObjectCommandInput = {
    Bucket: params.bucket,
    Key: `${indexPrefix}${params.filename}/${params.fileType}`,
    ContentType: 'application/json',
    Body: params.content,
  }
  const command = new PutObjectCommand(s3Params)
  return client.send(command)
}

Search and return results

To reduce storage access and improve latency, the index and the documents are cached in memory outside the handler top be re-used on subsequent invokes.

const bucket = process.env.BUCKET

let idx
let documents

export const handler = async (
  event: APIGatewayProxyEventV2,
): Promise<APIGatewayProxyResultV2<SearchResponse>> => {
  const searchQuery = event.queryStringParameters.search
  const indexName = event.queryStringParameters.indexname

  if (!idx) {
    idx = lunr.Index.load(JSON.parse(await getFromS3({ bucket, filename: indexName, fileType: 'idx' })))
    documents = JSON.parse(await getFromS3({ bucket, filename: indexName, fileType: 'doc' }))
  }

  const queryResult = idx.search(searchQuery)

  const response = queryResult.map((item) => {
    const match = documents.find((doc) => item.ref === doc.set_id)
    if (match) {
      const enhancedMatch: SearchResponseItem = {
        ...match,
        matchInfo: {
          score: item.score,
          matchData: item.matchData,
        },
      }
      return enhancedMatch
    }
    return undefined
  })

  return response
}

Let’s run it

We will be using 2 datasets to showcase the solution:

  • A list of Lego Sets
    • 14 fields
    • 18459 records
    • 3.8 MB
  • A list of Movies
    • 12 fields
    • 10178 records
    • 6.5 MB
    • Contains list of actors and synopsys

Indexing performance

Lego SetsMovies
Items1845910178
CSV size3.8 MB6.5 MB
Load source and convert306 ms502 ms
Build index5629 ms7685 ms
Store indexes and documents1062 ms1522 ms
Memory used984 MB1319 MB
Resulting index size25.8 MB39.0 MB
Resulting Document size7.0 MB7.7 MB

Query performance

  • Search Lego sets produced in 1984 in the Duplo theme
    • query: search=+year:1984 +theme:duplo
    • amount of results: 17
    • loading index and documents: 1927 ms (only on first load)
    • query duration: 10 ms
    • memory used: 562 MB
  • Search Movies with Ryan Reynolds
    • query: search=+ryan +reynolds
    • amount of results: 37
    • loading index and documents: 3924 ms (only on first load)
    • query duration: 5 ms
    • memory used: 690 MB
  • Search Movies with the word “extra-terrestrial” in the synopsis
    • query: search=overview:extra-terrestrial
    • amount of results: 17
    • loading index and documents: 3411 ms (only on first load)
    • query duration: 4 ms
    • memory used: 690 MB

As we can notice, loading the index is very slow, but twice as fast as building it. Requests using an already loaded function, don’t have any latency penalty.

Cost analysis

Let’s consider the movie dataset. And make the following assumptions:

  • Source file updated daily (30 times a month)
  • 1M search requests per month
    • 25% are fresh: 250,000
    • 75% are re-using an already loaded Lambda: 750,000
ItemVolumeMonthly Cost
S3 StorageCSV + Index + Docs53.2 MB$0.0012
S3 PUTDaily CSV + Index + Docs90$0.0005
S3 GET250k Indexes + 250k Docs500000$0.2150
Lambda Indexing2048 MB Memory, 10s30$0.01
Lambda Search Cold1024 MB Memory, 5s250000$16.72
Lambda Search Hot1024 MB Memory, 0.5s750000$1.40
Total$18.13

Comparing this to other available solutions from AWS:

Amazon OpenSearch Serverless

Minimum settings: 2 indexing OCU, 2 Read OCU, 1GB storage

Total monthly cost: $989.91

Amazon CloudSearch

This solution is not serverless. But managed enough to be considered as an alternative.

We need 2x search.small instances to provide a multi-AZ setup.

Total monthly cost: $99.28

Conclusion

We successfully built a Serverless based API Search service. But not without some drawbacks.

Pros

  • The solution is able to scale to match an increase in incoming requests.
  • We are only billed when a request is made.
  • We don’t need to manage any fleet of servers or additional services.
  • Very good solution for full text search. LunrJS has support for other languages than english.
  • We can improve the latency by reducing the index sizes. Indexing only the fields fields we want to search on will achieve that.

Cons

  • The solution is very primitive and can only do as much LunrJS is able.
    • There isn’t any capability of numeric range search, like all Lego Sets built between 1995 and 2000.
    • Searching for “+ryan +reynolds” actually searches for documents with ryan and reynolds. Returning “Ryan Gosling” and “Burt Reynolds”.
  • Index schemas should be defined depending on the source. In the case of the movies data set, we should split crew members and genres into an array.
  • Latency for fresh Lambdas is still slow. Can be improved by reducing the index and document sizes, or using a faster storage like EFS.