Playing with MongoDB

25 min. read

This is meant to be a small collection of notes that I wrote here and there while playing with MongoDB as my first NoSQL database. I also wrote a database comparison table recently, which compares and explains the differences between different types of databases, including relational and NoSQL.

There is also a talk by Martin Fowler called Introduction to NoSQL from 2012 which explains the reason for the existence of NoSQL databases, the difference with relational databases and when to use each (he also has a book that I will read at some point in the future). I found that talk very cool and it helped me to understand and put things in context a bit!

Introduction

NoSQL means, well, no SQL and no relational databases. NoSQL databases are good to store unstructured data, that doesn't follow a schema. It is usually stored as key value pairs, and it doesn't support joins. In the case of MongoDB, the name "mongo" comes from "humongous", refering to the fact that it was used mainly in Big Data applications.

Differences between relational and NoSQL

In a relational database, you have tables, made of rows, with columns/cell values, and a defined schema. In a NoSQL database, you have collections, made of documents, with key/value pairs, and a dynamic schema. Example:

  • Relational database: All rows have the same columns:
    
        Table: Creatures
        ------------------------------
        Id | Name    | Age | Specie
        ------------------------------
        1  | Coolio  |   4 | seahorse
        2  | Clownio |   6 | crab
    
  • NoSQL database: You can change the schema from one document to the other. Documents are JSON-like objects:
    
        "creatures": {
          {
            "id": 1,
            "name": "coolio",
            "age": 4,
            "race": "seahorse",
            "props": "sunglasses"
          },
          {
            "id": 2,
            "name": "clownio",
            "race": "crab"
          }
        }
    

    Here, creatures would be a collection and it has two documents, each with different "schemas".

The MondoDB shell

The Mongo shell is a JavaScript interpreter, so you can write plain Javascript like you would do in a node prompt. If you want to use MongoDB in your application, there are drivers for MongoDB in many languages.

But first, you have to install it! Choose the version for your operating system and make sure you have a server running before using the Mongo shell. To start the daemon:


$ sudo service mongod start

...and to stop it:


$ sudo service mongod stop

Then you can just type (in another terminal):


$ mongo
> use databasename

This command will switch to the databasename or create it, if it doesn't exist. To show all databases you can type show dbs which will show their names and sizes. The first time you start the MongoDB shell it will use a default database called test.

Inserting and deleting documents

Inserting

Let's create a database and insert some creatures:


$ mongo
> use ocean
switched to db ocean
> db
ocean
> db.creatures.insert(
   {
     "name": "coolio",
     "age": 4,
     "race": "seahorse",
     "props": "sunglasses"
   }
 )
WriteResult({ "nInserted" : 1 })
> db.creatures.insert(
   {
     "name": "clownio",
     "race": "crab",
   }
 )
WriteResult({ "nInserted" : 1 })
> show tables
creatures

If the creatures collection doesn't exist, it will create it. Also, when we create entries in a collection, we don't need to pass the id, it will generate one automatically, although we could pass a custom id if we wanted to. Child documents don't need an id either.

Deleting

The command remove() will remove all documents that match the query:


> db.creatures.remove({YOUR_QUERY})

Value types

BSON is kind of a wrapper over JSON, it allows you to save the same basic types but also ObjectId(...) and ISODate(...) types. You can check out some of the MongoDB BSON types or the BSON spec.

  • MongoDB preserves the precision of floating point numbers so if you save a 15.33 you will get a 15.33
  • Dates are added with the Date JavaScript function: "date": new Date(2017, 08, 14). Caveat: Javascript months begin at 0, so 8 is actually September. However, when dates are saved to the database, they will be saved as "date": ISODate("2017-08-14T03:25:00Z")
  • Array values are treated individually:

> db.creatures.insert(
...  {
    "name": "ninja",
    "race": "manta ray",
    "weapons": ["starfish", "sawshark", "trident"]
  }
 )
WriteResult({ "nInserted" : 1 })
> db.creatures.find({"weapons": "starfish"})
{ "_id" : ObjectId("5a4d0f54271798cd71c85b8e"), "name" : "ninja", "race" : "manta ray", "weapons" : [ "starfish", "sawshark", "trident" ] }
  • Subdocuments are found with dot notation:

> db.creatures.insert(
   {
     "name": "cop eye",
     "race": "dolphin",
     "pair": { "name": "dickens", "race": "tuna"}
   }
 )
WriteResult({ "nInserted" : 1 })
> db.creatures.find({ "pair.name": "dickens" })
{ "_id" : ObjectId("5a4d12ba271798cd71c85b8f"), "name" : "cop eye", "race" : "dolphin", "pair" : { "name" : "dickens", "race" : "tuna" } }
>

Validations

In MondoDB, you have to validate the data before it hits the database. For example, MongoDB let's you save a string in a number field:


> db.creatures.insert(
  {
    "name": "coolio",
    "age": "four",
    // ...
  }
)

MongoDB only performs these validations:

  • Unique _id
  • No syntax errors
  • Document is less than 16MB

Retrieving documents

To retrieve a single element in a collection:


> db.creatures.findOne()
{
  "_id" : ObjectId("5a4d099ae889d01cf188c510"),
  "name" : "coolio",
  "age" : 4,
  "race" : "seahorse",
  "props" : "sunglasses"
}

We can also find several documents, and use query arguments to select certain documents:


> db.creatures.find()
{ "_id" : ObjectId("5a4d099ae889d01cf188c510"), "name" : "coolio", "age" : 4, "race" : "seahorse", "props" : "sunglasses" }
{ "_id" : ObjectId("5a4d0a1be889d01cf188c511"), "name" : "clownio", "race" : "crab" }
> db.creatures.find({"name": "coolio"})
{ "_id" : ObjectId("5a4d099ae889d01cf188c510"), "name" : "coolio", "age" : 4, "race" : "seahorse", "props" : "sunglasses" }

We can use regex as well:


> db.creatures.find({"title": /foo/i })

We can make comparisons with $gt, $lt, $gte, $lte, and $ne


> db.creatures.find({"age": {"$gte": 2, "$lt": 5 }})
{ "_id" : ObjectId("5a4d099ae889d01cf188c510"), "name" : "coolio", "age" : 4, "race" : "seahorse", "props" : "ferrari" }
{ "_id" : ObjectId("5a4d0a1be889d01cf188c511"), "name" : "clownio", "race" : "crab", "age" : 2 }

We can use exclusions to remove sensitive data (whenever projecting, we either select or exclude the fields we want, we don't do both):


> db.creatures.find({"age": {"$gte": 2, "$lt": 5 }}, {"passport": false})

If you are doing a selection instead, then the only field that can be excluded is _id:


> db.creatures.find({"age": {"$gte": 2, "$lt": 5 }}, {"name": true, "_id": false})

Because find() returns a cursor, to iterate through the cursor we can type:


> it

We can call other methods on the cursor:


db.creatures.find().sort({ "age": 1}) // -1 for descending
db.creatures.find().limit(3) // for pagination
db.creatures.find().skip(3).limit(3) // for the second page
db.creatures.find().skip(6).limit(3) // for the third page

And we can pretty-print the results, count them or explain them:


> db.creatures.find().pretty()
{
  "_id" : ObjectId("5a4d099ae889d01cf188c510"),
  "name" : "coolio",
  "age" : 4,
  "race" : "seahorse",
  "props" : "sunglasses"
}
{
  "_id" : ObjectId("5a4d0a1be889d01cf188c511"),
  "name" : "clownio",
  "race" : "crab",
  "age" : 2
}
> db.creatures.find().count()
2
> db.creatures.find().explain()
{
  "queryPlanner" : {
    "plannerVersion" : 1,
    "namespace" : "ocean.creatures",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "$and" : [ ]
    },
    "winningPlan" : {
      "stage" : "COLLSCAN",
      "filter" : {
        "$and" : [ ]
      },
      "direction" : "forward"
    },
    "rejectedPlans" : [ ]
  },
  "serverInfo" : {
    "host" : "nitro",
    "port" : 27017,
    "gitVersion" : "4c1bae566c0c00f996a2feb16febf84936ecaf6f"
  },
  "ok" : 1
}

Updating documents

We can use the update operator "$set":


> db.creatures.update({"name": "coolio"}, { "$set": {"props": "ferrari"} })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

This operator updates the first matching document only. So if we want to update several documents, we have to add {"multi": true}


> db.creatures.update({"race": "dolphin"}, { "$set": {"intelligence": 100} }, {"multi": true})
WriteResult({ "nMatched" : 2, "nUpserted" : 0, "nModified" : 2 })

This will update all matches.

There is also the increment operator, which can use a negative number. If the key you are incrementing doesn't exist, it will create it.


> db.creatures.find({"name": "clownio"})
{ "_id" : ObjectId("5a4d0a1be889d01cf188c511"), "name" : "clownio", "race" : "crab", "age" : 6 }
> db.creatures.update({"name": "clownio"}, { "$inc": {"age": 1} })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.creatures.find({"name": "clownio"})
{ "_id" : ObjectId("5a4d0a1be889d01cf188c511"), "name" : "clownio", "race" : "crab", "age" : 7 }

If we try to increment a document that doesn't exist, it will create it if we add { "upsert": true }


> db.creatures.update({"name": "I don't exist yet"}, { "$inc": {"age": 1} }, { "upsert": true })
WriteResult({ "nMatched" : 0, "nUpserted" : 1, "nModified" : 0 })

Remove a key from all matching documents:


> db.creatures.update({}, { "$unset": {"race": ""} }, { "multi": true})
WriteResult({ "nMatched" : 4, "nUpserted" : 0, "nModified" : 4 })

Rename a key for all matching documents:


> db.creatures.update({}, { "$rename": {"name": "alias"} }, { "multi": true})
WriteResult({ "nMatched" : 4, "nUpserted" : 0, "nModified" : 4 })

Update item of array: you have to provide the index with dot notation:


> db.creatures.update({"name": "ninja"}, { "$set": {"weapons.1": "giant sawshark"} })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

We can also update for several documents when the item is in a different position. For that we use the position operator:


> db.creatures.update({"weapons": "sawshark"}, { "$set": {"weapons.$": "giant sawshark"} }, {"multi": true})
WriteResult({ "nMatched" : 2, "nUpserted" : 0, "nModified" : 2 })

> db.creatures.update({"name": "cop eye"}, { "$set": {"pair.alias": "incognito"} })
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

You can get a complete list of operators in the docs.

Data consistency

In NoSQL databases, data consistency can get tricky. There are no joins, hence, there are two main ways of using subdocuments: as referenced documents or as embedded documents.

Referenced documents

Instead of embedding the subdocuments in all documents, we can separate them in two collections and just reference them by id, pretty much like foreign keys in relational databases. This normalizes data, reduces duplication and reduces inconsistencies.


> db.weapons.insert({
  "_id": "starfish",
  //... etc.
}, {
  "_id": "sawshark",
  //... etc.
})

> db.creatures.insert({
  "name": "ninja",
  "weapons": [{"weapon_id": "starfish"}, {"weapon_id": "sawshark"}, ... etc.]
})

Referenced documents have the advantage that you update them in one place; they are accessed directly, independently of the parent, which makes for less inconsistencies in the data. But there is no warranty that both write operations, to the creatures and the weapons collection, will happen.

This would happen in a "transaction" in relational databases. Transactions are not supported in MongoDB. There is no multi-document write support, no rollbacks, and you run the risk of ending up with an unreferenced weapon.

Another problem with this approach is that we need two queries now: first query the creatures, then the weapons:


db.creatures.find({"name": "ninja"})
db.weapons.find({"_id": "starfish"})

Embedded documents

This is what we've been doing, one query to get all the data. These are atomic write operations, if we are inserting a document in a collection, the write completelly happens or doesn't happen at all. If something bad happens, we don't end up with a half-saved bunch of data. Also, subdocuments are accessed through their parent, in a single query.

How to decide for one or the other

To decide for one or the other, think about how the data is going to be used and ask these three questions: how often is the data used together? What is the expected size of the data? How often does the data change?

How often is the data used together?
Data that is frequently used together will benefit from being embedded, while data that's rarely used can afford the cost of referencing and performing two queries:
  • Always: Use embed for faster access
  • Sometimes: Embed or reference
  • Rarely: Embed or reference
What is the expected size?
The size of the data in a relationship has significant impact on data modelling. You might start to see a decline in read performance when using embedded documents with big sets of data.
  • < 100: Embed
  • > 100+: Embed or reference
  • >= 1000+: Definitely reference
How often does the data change?
Depending on whether or not the duplicate data changes a lot, sometimes embedding data can lead to data duplication and inconsistencies.
  • never/rarely: Embed, data duplication is OK if we don't expect change
  • occasionally: Embed or reference, depends on whether or not we want the overhead of managing duplication
  • constantly: Definitely reference, it will prevent inconsistencies

Some examples:

  1. Post that has comments that have authors, three queries, so maybe it's better to embed.
  2. Post comments: if comments don't change often, they are usually less than 100 and they are used together often, embed
  3. Comments author: if authors change, and we only need the name and maybe the avatar, reference

Embed is usually a good first option; if you have a lot of data then you can use referencing. Just think how you will use them, and if you need complex references maybe what you need is a relational database.

Aggregates

Aggregates are combined data. You can use stage operators to deal with them.


db.creatures.aggregate([{ "$group": {"_id": "$weapon_id"} }])

Keys with $ are operators and values with $ are field paths.

  • $group: this is a stage operator
  • _id the group key, required
  • $weapon_id a field path

Accumulators

Anything specified after the group key is considered an accumulator. Accumulators take a single expression and compute the expression for grouped documents


db.creatures.aggregate([{ "$group": {"_id": "$weapon_id", "total": {"$sum": 1}} }])
db.creatures.aggregate([{ "$group": {"_id": "$weapon_id", "average": {"$avg": "$age"}} }])
db.creatures.aggregate([{ "$group": {"_id": "$weapon_id", "max_age": {"$max": "$age"}} }])

Pipelines

The aggregate method acts like a pipeline, where we can pass data through many stages in order to change it along the way. Each stage modifies the working dataset and then passes the altered documents to the next stage until we get our desired results.


db.creatures.aggregate([
  { "$match": { "weapons": "starfish" }},
  { "$project": {"name": true} },
  { "$group": {"_id": "$weapon_id", "total": {"$sum": 1}} },
  { "$sort": {"total": -1} },
  { "$limit": 3 }
])

A good practice is to use $match and $project early to reduce the number of documents to handle and improve performance. The queries passed are like those you would use with find().


And that is all I have! Hope you enjoyed it.

Comments