With the release of MongoDB 4.4 comes a new aggregation pipeline stage called
$unionWith
. This stage lets you combine multiple collections into a
single result set!
Here's how you'd use it:
Simplified syntax, with no additional processing on the specified collection
1 db.collection.aggregate([ 2 { $unionWith: "<anotherCollection>" } 3 ])
Extended syntax, using optional pipeline field
1 db.collection.aggregate([ 2 { $unionWith: { coll: "<anotherCollection>", pipeline: [ <stage1>, etc. ] } } 3 ])
⚠ If you use the pipeline field to process your collection before combining, keep in mind that stages that write data, like
$out
and$merge
, can't be used!
Your resulting documents will merge your current collection's (or pipeline's) stream of documents with the documents from the collection/pipeline you specify. Keep in mind that this can include duplicates!
#This sounds kinda familiar..
If you've used the UNION ALL
operation in SQL before, the
$unionWith
stage's functionality may sound familiar to you, and you
wouldn't be wrong! Both combine the result sets from multiple queries
and return the merged rows, some of which may be duplicates. However,
that's where the similarities end. Unlike MongoDB's $unionWith
stage, you have to follow a few
rules
in order to run a valid UNION ALL
operation in SQL:
- Make sure your two queries have the same number of columns
- Make sure the order of columns are the same
- Make sure the matching columns are compatible data types.
It'd look something like this in SQL:
1 SELECT column1, expression1, column2 2 FROM table1 3 UNION ALL 4 SELECT column1, expression1, column2 5 FROM table2 6 WHERE [conditions]
With the $unionWith
stage in MongoDB, you don't have to worry about
these stringent constraints.
#So how is MongoDB's $unionWith
stage different?
The most convenient difference between the $unionWith
stage and
other UNION operations is that there's no matching schema restriction.
This flexible schema support means you can combine documents that may
not have the same type or number of fields. This is common in certain
scenarios, where the data we need to use comes from different sources:
- TimeSeries data that's stored by month/quarter/some other unit of time
- IoT device data, per fleet or version
- Archival and Recent data, stored in a Data Lake
- Regional data
With MongoDB's $unionWith
stage, combining these data sources is
possible.
Ready to try the new $unionWith
stage? Follow along by completing a
few setup steps first. Or, you can skip to the code
samples. 😉
#Prerequisites
First, a general understanding of what the aggregation framework is and how to use it will be important for the rest of this tutorial. If you are unfamiliar with the aggregation framework, check out this great Introduction to the MongoDB Aggregation Framework, written by fellow dev advocate Ken Alger!
Next, based on your situation, you may already have a few prerequisites setup or need to start from scratch. Either way, choose your scenario to configure the things you need so that you can follow the rest of this tutorial!
Choose your scenario:
I don't have an Atlas cluster set up yet:
- You'll need an Atlas account to play around with MongoDB Atlas! Create one if you haven't already done so. Otherwise, log into your Atlas account.
- Setup a free Atlas cluster (no credit card needed!). Be sure to select MongoDB 4.4 (may be Beta, which is OK) as your version in Additional Settings!
💡 If you don't see the prompt to create a cluster: You may be prompted to create a project first before you see the prompt to create your first cluster. In this case, go ahead and create a project first (leaving all the default settings). Then continue with the instructions to deploy your first free cluster!
- Once your cluster is set up, add your IP address to your cluster's connection settings. This tells your cluster who's allowed to connect to it.
- Finally, create a database user for your cluster. Atlas requires anyone or anything accessing its clusters to authenticate as MongoDB database users for security purposes! Keep these credentials handy as you'll need them later on.
- Continue with the steps in Connecting to your cluster.
I have an Atlas cluster set up:
Great! You can skip ahead to Connecting to your cluster.
Connecting to your cluster
To connect to your cluster, we'll use the MongoDB for Visual Studio Code extension (VS Code for short 😊). You can view your data directly, interact with your collections, and much more with this helpful extension! Using this also consolidates our workspace into a single window, removing the need for us to jump back and forth between our code and MongoDB Atlas!
💡 Though we'll be using the VS Code Extension and VS Code for the rest of this tutorial, it's not a requirement to use the
$unionWith
pipeline stage! You can also use the CLI, language-specific drivers, or Compass if you prefer!
- Install the MongoDB for VS Code extension (or install VS Code first, if you don't already have it 😉).
- To connect to your cluster, you'll need a connection string. You can get this connection string from your cluster connection settings. Go to your cluster and select the "Connect" option:
data:image/s3,"s3://crabby-images/424c8/424c87e108bff9cd7e5a6f17f2137a7e0a44ecbe" alt="Connecting to your Atlas cluster"
- Select the "Connect using MongoDB Compass" option. This will give us a connection string in the DNS Seedlist Connection format that we can use with the MongoDB extension.
data:image/s3,"s3://crabby-images/5eeef/5eeefe713fb82f70e4233905a4cc7c02eec56d5d" alt="Choosing the "Connect with MongoDB Compass" option"
💡 The MongoDB for VS Code extension also supports the standard connection string format. Using the DNS seedlist connection format is purely preference.
- Skip to the second step and copy the connection string (don't worry about the other settings, you won't need them):
data:image/s3,"s3://crabby-images/41a2c/41a2c5ce20661de180949b61bce0d972233ec9b2" alt="Copy the connection string in MongoDB Atlas connection settings"
- Switch back to VS Code. Press
Ctrl
+Shift
+P
(on Windows) orShift
+Command
+P
(on Mac) to bring up the command palette. This shows a list of all VS Code commands.
data:image/s3,"s3://crabby-images/568da/568da982a5c6422de2ff7333d159ede1b3c5ba47" alt="Showing the command palette in VS Code"
- Start typing "MongoDB" until you see the MongoDB extension's list of available commands. Select the "MongoDB: Connect with Connection String" option.
data:image/s3,"s3://crabby-images/62f24/62f243e8ec2d096ec71b64524893156dae8cbe22" alt="Searching for MongoDB extension commands"
- Paste in your copied connection string. 💡 Don't forget! You have to replace the placeholder password with your actual password!
data:image/s3,"s3://crabby-images/9a788/9a788b1a7aac59161a8f6227d4360c2eafaaa763" alt="Pasting and modifying our MongoDB Atlas cluster connection string"
- Press enter to connect! You'll know the connection was successful if you see a confirmation message on the bottom right. You'll also see your cluster listed when you expand the MongoDB extension pane.
With the MongoDB extension installed and your cluster connected, you can
now use MongoDB Playgrounds to test out the $unionWith
examples! MongoDB
Playgrounds give us a nice sandbox to easily write and test Mongo
queries. I love using it when prototying or trying something new because
it has query auto-completion and syntax highlighting, something that you
don't get in most terminals.
Let's finally dive into some examples!
#Examples
To follow along, you can use these MongoDB Playground files I have created to accompany this blog post or create your own!
💡 If you create your own playground, remember to change the database name and delete the default template's code first!
#$unionWith
using a pipeline
📃 Use this playground if you'd like follow along with pre-written code for this example.
Right at the top, specify the database you'll be using. In this example, I'm using a database also called union-walkthrough
:
1 use('union-walkthrough');
💡 I haven't actually created a database called
union-walkthrough
in Atlas yet, but that's no problem! When the playground runs, it will see that it does not yet exist and create a database of the specified name!
Next, we need data! Particularly about some planets. And particularly about planets in a certain movie series. 😉
Using the awesome SWAPI API, I've collected such information on a few planets. Let's add them into two collections, separated by popularity.
Any planets that appear in at least 2 or more films are considered
popular. Otherwise, we'll add them into the lonely_planets
collection:
1 // Insert a few documents into the lonely_planets collection. 2 db.lonely_planets.insertMany([ 3 { 4 "name": "Endor", 5 "rotation_period": "18", 6 "orbital_period": "402", 7 "diameter": "4900", 8 "climate": "temperate", 9 "gravity": "0.85 standard", 10 "terrain": "forests, mountains, lakes", 11 "surface_water": "8", 12 "population": "30000000", 13 "residents": [ 14 "http://swapi.dev/api/people/30/" 15 ], 16 "films": [ 17 "http://swapi.dev/api/films/3/" 18 ], 19 "created": "2014-12-10T11:50:29.349000Z", 20 "edited": "2014-12-20T20:58:18.429000Z", 21 "url": "http://swapi.dev/api/planets/7/" 22 }, 23 { 24 "name": "Kamino", 25 "rotation_period": "27", 26 "orbital_period": "463", 27 "diameter": "19720", 28 "climate": "temperate", 29 "gravity": "1 standard", 30 "terrain": "ocean", 31 "surface_water": "100", 32 "population": "1000000000", 33 "residents": [ 34 "http://swapi.dev/api/people/22/", 35 "http://swapi.dev/api/people/72/", 36 "http://swapi.dev/api/people/73/" 37 ], 38 "films": [ 39 "http://swapi.dev/api/films/5/" 40 ], 41 "created": "2014-12-10T12:45:06.577000Z", 42 "edited": "2014-12-20T20:58:18.434000Z", 43 "url": "http://swapi.dev/api/planets/10/" 44 }, 45 { 46 "name": "Yavin IV", 47 "rotation_period": "24", 48 "orbital_period": "4818", 49 "diameter": "10200", 50 "climate": "temperate, tropical", 51 "gravity": "1 standard", 52 "terrain": "jungle, rainforests", 53 "surface_water": "8", 54 "population": "1000", 55 "residents": [], 56 "films": [ 57 "http://swapi.dev/api/films/1/" 58 ], 59 "created": "2014-12-10T11:37:19.144000Z", 60 "edited": "2014-12-20T20:58:18.421000Z", 61 "url": "http://swapi.dev/api/planets/3/" 62 }, 63 { 64 "name": "Hoth", 65 "rotation_period": "23", 66 "orbital_period": "549", 67 "diameter": "7200", 68 "climate": "frozen", 69 "gravity": "1.1 standard", 70 "terrain": "tundra, ice caves, mountain ranges", 71 "surface_water": "100", 72 "population": "unknown", 73 "residents": [], 74 "films": [ 75 "http://swapi.dev/api/films/2/" 76 ], 77 "created": "2014-12-10T11:39:13.934000Z", 78 "edited": "2014-12-20T20:58:18.423000Z", 79 "url": "http://swapi.dev/api/planets/4/" 80 }, 81 { 82 "name": "Bespin", 83 "rotation_period": "12", 84 "orbital_period": "5110", 85 "diameter": "118000", 86 "climate": "temperate", 87 "gravity": "1.5 (surface), 1 standard (Cloud City)", 88 "terrain": "gas giant", 89 "surface_water": "0", 90 "population": "6000000", 91 "residents": [ 92 "http://swapi.dev/api/people/26/" 93 ], 94 "films": [ 95 "http://swapi.dev/api/films/2/" 96 ], 97 "created": "2014-12-10T11:43:55.240000Z", 98 "edited": "2014-12-20T20:58:18.427000Z", 99 "url": "http://swapi.dev/api/planets/6/" 100 } 101 ]); 102 103 // Insert a few documents into the popular_planets collection. 104 db.popular_planets.insertMany([ 105 { 106 "name": "Tatooine", 107 "rotation_period": "23", 108 "orbital_period": "304", 109 "diameter": "10465", 110 "climate": "arid", 111 "gravity": "1 standard", 112 "terrain": "desert", 113 "surface_water": "1", 114 "population": "200000", 115 "residents": [ 116 "http://swapi.dev/api/people/1/", 117 "http://swapi.dev/api/people/2/", 118 "http://swapi.dev/api/people/4/", 119 "http://swapi.dev/api/people/6/", 120 "http://swapi.dev/api/people/7/", 121 "http://swapi.dev/api/people/8/", 122 "http://swapi.dev/api/people/9/", 123 "http://swapi.dev/api/people/11/", 124 "http://swapi.dev/api/people/43/", 125 "http://swapi.dev/api/people/62/" 126 ], 127 "films": [ 128 "http://swapi.dev/api/films/1/", 129 "http://swapi.dev/api/films/3/", 130 "http://swapi.dev/api/films/4/", 131 "http://swapi.dev/api/films/5/", 132 "http://swapi.dev/api/films/6/" 133 ], 134 "created": "2014-12-09T13:50:49.641000Z", 135 "edited": "2014-12-20T20:58:18.411000Z", 136 "url": "http://swapi.dev/api/planets/1/" 137 }, 138 { 139 "name": "Alderaan", 140 "rotation_period": "24", 141 "orbital_period": "364", 142 "diameter": "12500", 143 "climate": "temperate", 144 "gravity": "1 standard", 145 "terrain": "grasslands, mountains", 146 "surface_water": "40", 147 "population": "2000000000", 148 "residents": [ 149 "http://swapi.dev/api/people/5/", 150 "http://swapi.dev/api/people/68/", 151 "http://swapi.dev/api/people/81/" 152 ], 153 "films": [ 154 "http://swapi.dev/api/films/1/", 155 "http://swapi.dev/api/films/6/" 156 ], 157 "created": "2014-12-10T11:35:48.479000Z", 158 "edited": "2014-12-20T20:58:18.420000Z", 159 "url": "http://swapi.dev/api/planets/2/" 160 }, 161 { 162 "name": "Naboo", 163 "rotation_period": "26", 164 "orbital_period": "312", 165 "diameter": "12120", 166 "climate": "temperate", 167 "gravity": "1 standard", 168 "terrain": "grassy hills, swamps, forests, mountains", 169 "surface_water": "12", 170 "population": "4500000000", 171 "residents": [ 172 "http://swapi.dev/api/people/3/", 173 "http://swapi.dev/api/people/21/", 174 "http://swapi.dev/api/people/35/", 175 "http://swapi.dev/api/people/36/", 176 "http://swapi.dev/api/people/37/", 177 "http://swapi.dev/api/people/38/", 178 "http://swapi.dev/api/people/39/", 179 "http://swapi.dev/api/people/42/", 180 "http://swapi.dev/api/people/60/", 181 "http://swapi.dev/api/people/61/", 182 "http://swapi.dev/api/people/66/" 183 ], 184 "films": [ 185 "http://swapi.dev/api/films/3/", 186 "http://swapi.dev/api/films/4/", 187 "http://swapi.dev/api/films/5/", 188 "http://swapi.dev/api/films/6/" 189 ], 190 "created": "2014-12-10T11:52:31.066000Z", 191 "edited": "2014-12-20T20:58:18.430000Z", 192 "url": "http://swapi.dev/api/planets/8/" 193 }, 194 { 195 "name": "Coruscant", 196 "rotation_period": "24", 197 "orbital_period": "368", 198 "diameter": "12240", 199 "climate": "temperate", 200 "gravity": "1 standard", 201 "terrain": "cityscape, mountains", 202 "surface_water": "unknown", 203 "population": "1000000000000", 204 "residents": [ 205 "http://swapi.dev/api/people/34/", 206 "http://swapi.dev/api/people/55/", 207 "http://swapi.dev/api/people/74/" 208 ], 209 "films": [ 210 "http://swapi.dev/api/films/3/", 211 "http://swapi.dev/api/films/4/", 212 "http://swapi.dev/api/films/5/", 213 "http://swapi.dev/api/films/6/" 214 ], 215 "created": "2014-12-10T11:54:13.921000Z", 216 "edited": "2014-12-20T20:58:18.432000Z", 217 "url": "http://swapi.dev/api/planets/9/" 218 }, 219 { 220 "name": "Dagobah", 221 "rotation_period": "23", 222 "orbital_period": "341", 223 "diameter": "8900", 224 "climate": "murky", 225 "gravity": "N/A", 226 "terrain": "swamp, jungles", 227 "surface_water": "8", 228 "population": "unknown", 229 "residents": [], 230 "films": [ 231 "http://swapi.dev/api/films/2/", 232 "http://swapi.dev/api/films/3/", 233 "http://swapi.dev/api/films/6/" 234 ], 235 "created": "2014-12-10T11:42:22.590000Z", 236 "edited": "2014-12-20T20:58:18.425000Z", 237 "url": "http://swapi.dev/api/planets/5/" 238 } 239 ]);
This separation is indicative of how our data may be grouped. Despite
the separation, we can use the $unionWith
stage to combine these two
collections if we ever needed to analyze them as a single result set!
Let's say that we needed to find out the total population of planets, grouped by climate. Additionally, we'd like to leave out any planets that don't have population data from our calculation. We can do this using an aggregation:
1 // Run an aggregation to view total planet populations, grouped by climate type. 2 use('union-walkthrough'); 3 4 db.lonely_planets.aggregate([ 5 { 6 $match: { 7 population: { $ne: 'unknown' } 8 } 9 }, 10 { 11 $unionWith: { 12 coll: 'popular_planets', 13 pipeline: [{ 14 $match: { 15 population: { $ne: 'unknown' } 16 } 17 }] 18 } 19 }, 20 { 21 $group: { 22 _id: '$climate', totalPopulation: { $sum: { $toLong: '$population' } } 23 } 24 } 25 ]);
If you've followed along in your own MongoDB playground and have copied the code so far, try running the aggregation!
And if you're using the provided MongoDB playground I created, highlight lines 264 - 290 and then run the selected code.
💡 You'll notice in the code snippet above that I've added another
use('union-walkthrough');
method right above the aggregation code. I do this to make the selection of relevant code within the playground easier. It's also required so that the aggregation code can run against the correct database. However, the same thing can be achieved by selecting multiple lines, namely the originaluse('union-walkthrough')
line at the top and whatever additional example you'd like to run!
You should see the results like so:
1 [ 2 { 3 _id: 'arid', 4 totalPopulation: 200000 5 }, 6 { 7 _id: 'temperate', 8 totalPopulation: 1007536000000 9 }, 10 { 11 _id: 'temperate, tropical', 12 totalPopulation: 1000 13 } 14 ]
Unsurprisingly, planets with "temperate" climates seem to have more inhabitants. Something about that cool 75 F / 23.8 C, I guess 🌞
Let's break down this aggregation:
The first object we pass into our aggregation is also our first stage, used here as our filter criteria. Specifically, we use the $match pipeline stage:
1 { 2 $match: { 3 population: { $ne: 'unknown' } 4 } 5 },
In this example, we filter out any documents that have unknown
as
their population
value using the
$ne
(not equal) operator.
The next object (and next stage) in our aggregation is our
$unionWith
stage. Here, we specifiy what collection we'd like to
perform a union with (including any duplicates). We also make use of the
pipeline field to similarly filter out any documents in our
popular_planets
collection that have an unknown population:
1 { 2 $unionWith: { 3 coll: 'popular_planets', 4 pipeline: [ 5 { 6 $match: { 7 population: { $ne: 'unknown' } 8 } 9 } 10 ] 11 } 12 },
Finally, we have our last stage in our aggregation. After combining our
lonely_planets
and popular_planets
collections (both filtering
out documents with no population data), we group the resulting documents
using a $group stage:
1 { 2 $group: { 3 _id: '$climate', 4 totalPopulation: { $sum: { $toLong: '$population' } } 5 } 6 }
Since we want to know the total population per climate type, we first
specify _id
to be the $climate
field from our combined
result set. Then, we calculate a new field called totalPopulation
by
using a
$sum
operator to add each matching document's population values together.
You'll also notice that based on the data we have, we needed to use a
$toLong
operator to first convert our $population
field into a calculable
value!
#$unionWith
without a pipeline
📃 Use this playground if you'd like follow along with pre-written code for this example.
Now, if you don't need to run some additional processing on the
collection you're combining with, you don't have to! The pipeline
field is optional and is only there if you need it.
So, if you just need to work with the planet data as a unified set, you can do that too:
1 // Run an aggregation with no pipeline 2 use('union-walkthrough'); 3 4 db.lonely_planets.aggregate([ 5 { $unionWith: 'popular_planets' } 6 ]);
Copy this aggregation into your own playground and run it! Alternatively, select and run lines 293 - 297 if using the provided MongoDB playground!
Tada! Now you can use this unified dataset for analysis or further processing.
#Different Schemas
Combining the same schemas is great, but we can do
that in regular SQL too! The real convenience of the $unionWith
pipeline stage is that it can also combine collections with different
schemas. Let's take a look!
#$unionWith
using collections with different schemas
📃 Use this playground if you'd like follow along with pre-written code for this example.
As before, we'll specifiy the database we want to use:
1 use('union-walkthrough');
This time, we'll use some acquired information about certain starships and vehicles that are used in this same movie series. Let's add them to their respective collections:
1 // Insert a few documents into the starships collection 2 db.starships.insertMany([ 3 { 4 "name": "Death Star", 5 "model": "DS-1 Orbital Battle Station", 6 "manufacturer": "Imperial Department of Military Research, Sienar Fleet Systems", 7 "cost_in_credits": "1000000000000", 8 "length": "120000", 9 "max_atmosphering_speed": "n/a", 10 "crew": 342953, 11 "passengers": 843342, 12 "cargo_capacity": "1000000000000", 13 "consumables": "3 years", 14 "hyperdrive_rating": 4.0, 15 "MGLT": 10, 16 "starship_class": "Deep Space Mobile Battlestation", 17 "pilots": [] 18 }, 19 { 20 "name": "Millennium Falcon", 21 "model": "YT-1300 light freighter", 22 "manufacturer": "Corellian Engineering Corporation", 23 "cost_in_credits": "100000", 24 "length": "34.37", 25 "max_atmosphering_speed": "1050", 26 "crew": 4, 27 "passengers": 6, 28 "cargo_capacity": 100000, 29 "consumables": "2 months", 30 "hyperdrive_rating": 0.5, 31 "MGLT": 75, 32 "starship_class": "Light freighter", 33 "pilots": [ 34 "http://swapi.dev/api/people/13/", 35 "http://swapi.dev/api/people/14/", 36 "http://swapi.dev/api/people/25/", 37 "http://swapi.dev/api/people/31/" 38 ] 39 }, 40 { 41 "name": "Y-wing", 42 "model": "BTL Y-wing", 43 "manufacturer": "Koensayr Manufacturing", 44 "cost_in_credits": "134999", 45 "length": "14", 46 "max_atmosphering_speed": "1000km", 47 "crew": 2, 48 "passengers": 0, 49 "cargo_capacity": 110, 50 "consumables": "1 week", 51 "hyperdrive_rating": 1.0, 52 "MGLT": 80, 53 "starship_class": "assault starfighter", 54 "pilots": [] 55 }, 56 { 57 "name": "X-wing", 58 "model": "T-65 X-wing", 59 "manufacturer": "Incom Corporation", 60 "cost_in_credits": "149999", 61 "length": "12.5", 62 "max_atmosphering_speed": "1050", 63 "crew": 1, 64 "passengers": 0, 65 "cargo_capacity": 110, 66 "consumables": "1 week", 67 "hyperdrive_rating": 1.0, 68 "MGLT": 100, 69 "starship_class": "Starfighter", 70 "pilots": [ 71 "http://swapi.dev/api/people/1/", 72 "http://swapi.dev/api/people/9/", 73 "http://swapi.dev/api/people/18/", 74 "http://swapi.dev/api/people/19/" 75 ] 76 }, 77 ]); 78 79 // Insert a few documents into the vehicles collection 80 db.vehicles.insertMany([ 81 { 82 "name": "Sand Crawler", 83 "model": "Digger Crawler", 84 "manufacturer": "Corellia Mining Corporation", 85 "cost_in_credits": "150000", 86 "length": "36.8 ", 87 "max_atmosphering_speed": 30, 88 "crew": 46, 89 "passengers": 30, 90 "cargo_capacity": 50000, 91 "consumables": "2 months", 92 "vehicle_class": "wheeled", 93 "pilots": [] 94 }, 95 { 96 "name": "X-34 landspeeder", 97 "model": "X-34 landspeeder", 98 "manufacturer": "SoroSuub Corporation", 99 "cost_in_credits": "10550", 100 "length": "3.4 ", 101 "max_atmosphering_speed": 250, 102 "crew": 1, 103 "passengers": 1, 104 "cargo_capacity": 5, 105 "consumables": "unknown", 106 "vehicle_class": "repulsorcraft", 107 "pilots": [], 108 }, 109 { 110 "name": "AT-AT", 111 "model": "All Terrain Armored Transport", 112 "manufacturer": "Kuat Drive Yards, Imperial Department of Military Research", 113 "cost_in_credits": "unknown", 114 "length": "20", 115 "max_atmosphering_speed": 60, 116 "crew": 5, 117 "passengers": 40, 118 "cargo_capacity": 1000, 119 "consumables": "unknown", 120 "vehicle_class": "assault walker", 121 "pilots": [], 122 "films": [ 123 "http://swapi.dev/api/films/2/", 124 "http://swapi.dev/api/films/3/" 125 ], 126 "created": "2014-12-15T12:38:25.937000Z", 127 "edited": "2014-12-20T21:30:21.677000Z", 128 "url": "http://swapi.dev/api/vehicles/18/" 129 }, 130 { 131 "name": "AT-ST", 132 "model": "All Terrain Scout Transport", 133 "manufacturer": "Kuat Drive Yards, Imperial Department of Military Research", 134 "cost_in_credits": "unknown", 135 "length": "2", 136 "max_atmosphering_speed": 90, 137 "crew": 2, 138 "passengers": 0, 139 "cargo_capacity": 200, 140 "consumables": "none", 141 "vehicle_class": "walker", 142 "pilots": [ 143 "http://swapi.dev/api/people/13/" 144 ] 145 }, 146 { 147 "name": "Storm IV Twin-Pod cloud car", 148 "model": "Storm IV Twin-Pod", 149 "manufacturer": "Bespin Motors", 150 "cost_in_credits": "75000", 151 "length": "7", 152 "max_atmosphering_speed": 1500, 153 "crew": 2, 154 "passengers": 0, 155 "cargo_capacity": 10, 156 "consumables": "1 day", 157 "vehicle_class": "repulsorcraft", 158 "pilots": [], 159 } 160 ]);
You may be thinking (as I first did), what's the difference between starships and vehicles? You'll be pleased to know that starships are defined as any "single transport craft that has hyperdrive capability". Any other single transport craft that does not have hyperdrive capability is considered a vehicle. The more you know! 😮
If you look at the two collections, you'll see that they have two key differences:
- The
max_atmosphering_speed
field is present in both collections, but is astring
in thestarships
collection and anint
in thevehicles
collection. - The
starships
collection has two fields (hyperdrive_rating
,MGLT
) that are not present in thevehicles
collection, as it only relates to starships.
But you know what? That's not a problem for the $unionWith
stage!
You can combine them just as before:
1 // Run an aggregation with no pipeline and differing schemas 2 use('union-walkthrough'); 3 4 db.starships.aggregate([ 5 { $unionWith: 'vehicles' } 6 ]);
Try running the aggregation in your playground! Or if you're following along in the MongoDB playground I've provided, select and run lines 185 - 189! You should get the following combined result set as your output:
1 [ 2 { 3 _id: 5f306ddca3ee8339643f137e, 4 name: 'Death Star', 5 model: 'DS-1 Orbital Battle Station', 6 manufacturer: 'Imperial Department of Military Research, Sienar Fleet Systems', 7 cost_in_credits: '1000000000000', 8 length: '120000', 9 max_atmosphering_speed: 'n/a', 10 crew: 342953, 11 passengers: 843342, 12 cargo_capacity: '1000000000000', 13 consumables: '3 years', 14 hyperdrive_rating: 4, 15 MGLT: 10, 16 starship_class: 'Deep Space Mobile Battlestation', 17 pilots: [] 18 }, 19 { 20 _id: 5f306ddca3ee8339643f137f, 21 name: 'Millennium Falcon', 22 model: 'YT-1300 light freighter', 23 manufacturer: 'Corellian Engineering Corporation', 24 cost_in_credits: '100000', 25 length: '34.37', 26 max_atmosphering_speed: '1050', 27 crew: 4, 28 passengers: 6, 29 cargo_capacity: 100000, 30 consumables: '2 months', 31 hyperdrive_rating: 0.5, 32 MGLT: 75, 33 starship_class: 'Light freighter', 34 pilots: [ 35 'http://swapi.dev/api/people/13/', 36 'http://swapi.dev/api/people/14/', 37 'http://swapi.dev/api/people/25/', 38 'http://swapi.dev/api/people/31/' 39 ] 40 }, 41 // + 7 other results, omitted for brevity 42 ]
Can you imagine doing that in SQL? Hint: You can't! That kind of schema restriction is something you don't need to worry about with MongoDB, though!
#$unionWith using collections with different schemas and a pipeline
📃 Use this playground if you'd like follow along with pre-written code for this example.
So we can combine different schemas no problem. What if we need to do a
little extra work on our collection before combining it? That's where
the pipeline
field comes in!
Let's say that there's some classified information in our data about the vehicles. Namely, any vehicles manufactured by Kuat Drive Yards (AKA a division of the Imperial Department of Military Research).
By direct orders, you are instructed not to give out this information under any circumstances. In fact, you need to intercept any requests for vehicle information and remove these classified vehicles from the list!
We can do that like so:
1 use('union-walkthrough'); 2 3 db.starships.aggregate([ 4 { 5 $unionWith: { 6 coll: 'vehicles', 7 pipeline: [ 8 { 9 $redact: { 10 $cond: { 11 if: { $eq: [ "$manufacturer", "Kuat Drive Yards, Imperial Department of Military Research"] }, 12 then: "$$PRUNE", 13 else: "$$DESCEND" 14 } 15 } 16 } 17 ] 18 } 19 } 20 ]);
In this example, we're combining the starships
and vehicles
collections as before, using the $unionWith
pipeline stage. We
also process the vehicle
data a bit more, using the $unionWith
's optional pipeline
field:
1 // Pipeline used with the vehicle collection 2 { 3 $redact: { 4 $cond: { 5 if: { $eq: [ "$manufacturer", "Kuat Drive Yards, Imperial Department of Military Research"] }, 6 then: "$$PRUNE", 7 else: "$$DESCEND" 8 } 9 } 10 }
Inside the $unionWith
's pipeline, we use a
$redact
stage to restrict the contents of our documents based on a condition. The condition is specified using the
$cond
operator, which acts like an if/else
statement.
In our case, we are evaluating whether or not the manufacturer
field holds a value of
"Kuat Drive Yards, Imperial Department of Military Research". If it does
(uh oh, that's classified!), we use a system variable called
$$PRUNE, which lets us exclude all fields at the current
document/embedded document level. If it doesn't, we use another system
variable called $$DESCEND, which will return all fields at the
current document level, except for any embedded documents.
This works perfectly for our use case. Try running the aggregation (lines 192 - 211, if using the provided MongoDB Playground). You should see a combined result set, minus any Imperial manufactured vehicles:
1 [ 2 { 3 _id: 5f306ddca3ee8339643f137e, 4 name: 'Death Star', 5 model: 'DS-1 Orbital Battle Station', 6 manufacturer: 'Imperial Department of Military Research, Sienar Fleet Systems', 7 cost_in_credits: '1000000000000', 8 length: '120000', 9 max_atmosphering_speed: 'n/a', 10 crew: 342953, 11 passengers: 843342, 12 cargo_capacity: '1000000000000', 13 consumables: '3 years', 14 hyperdrive_rating: 4, 15 MGLT: 10, 16 starship_class: 'Deep Space Mobile Battlestation', 17 pilots: [] 18 }, 19 { 20 _id: 5f306ddda3ee8339643f1383, 21 name: 'X-34 landspeeder', 22 model: 'X-34 landspeeder', 23 manufacturer: 'SoroSuub Corporation', 24 cost_in_credits: '10550', 25 length: '3.4 ', 26 max_atmosphering_speed: 250, 27 crew: 1, 28 passengers: 1, 29 cargo_capacity: 5, 30 consumables: 'unknown', 31 vehicle_class: 'repulsorcraft', 32 pilots: [] 33 }, 34 // + 5 more non-Imperial manufactured results, omitted for brevity 35 ]
We did our part to restrict classified information! 🎶 Hums Imperial March 🎶
#Restrictions for UNION ALL
Now that we know how the $unionWith
stage works, it's important to
discuss its limits and restrictions.
#Duplicates
We've mentioned it already, but it's important to reiterate: using the
$unionWith
stage will give you a combined result set which may
include duplicates! This is equivalent to how the UNION ALL
operator
works in SQL
as well. As a workaround, using a $group
stage at the
end of your pipeline to remove duplicates is advised, but only when possible
and if the resulting data does not get inaccurately skewed.
There are plans to add similar fuctionality to
UNION
(which combines result sets but removes duplicates), but
that may be in a future release.
#Sharded Collections
If you use a $unionWith
stage as part of a
$lookup
pipeline, the collection you specify for the $unionWith
cannot be
sharded. As an example, take a look at this aggregation:
1 // Invalid aggregation (tried to use sharded collection with $unionWith) 2 db.lonely_planets.aggregate([ 3 { 4 $lookup: { 5 from: "extinct_planets", 6 let: { last_known_population: "$population", years_extinct: "$time_extinct" }, 7 pipeline: [ 8 // Filter criteria 9 { $unionWith: { coll: "questionable_planets", pipeline: [ { pipeline } ] } }, 10 // Other pipeline stages 11 ], 12 as: "planetdata" 13 } 14 } 15 ])
The coll questionable_planets
(located within the $unionWith
stage) cannot be sharded.
This is enforced to prevent
a significant decrease in performance due to the shuffling of data
around the cluster as it determines the best execution plan.
#Transactions
Aggregation pipelines can't use the $unionWith
stage inside
transactions because a rare, but possible 3-thread deadlock can occur in very
niche scenarios. Additionally, in MongoDB 4.4, there is a first-time
definition of a view that would restrict its reading from within a
transaction.
#$out
and $merge
The
$out
and
$merge
stages cannot be used in a $unionWith
pipeline. Since both $out
and $merge
are
stages that write data to a collection, they need to be the last stage in
a pipeline. This conflicts with the usage of the $unionWith
stage
as it outputs its combined result set onto the next stage, which
can be used at any point in an aggregation pipeline.
#Collations
If your aggregation includes a collation, that collation is used for the operation, ignoring any other collations.
However, if your aggregation doesn't include a collation, it will use the collation for the top-level collection/view on which the aggregation is run:
- If the
$unionWith
coll is a collection, its collation is ignored. - If the
$unionWith
coll is a view, then its collation must match that of the top-level collection/view. Otherwise, the operation errors.
#You've made it to the end!
We've discussed what the $unionWith
pipeline stage is and how you can
use it in your aggregations to combine data from multiple collections.
Though similar to SQL's UNION ALL
operation, MongoDB's $unionWith
stage distinguishes itself
through some convenient and much-needed characteristics. Most notable is the ability to combine
collections with different schemas! And as a much needed improvement, using a $unionWith
stage
eliminates the need to write additional code, code that was required because we had no other way to
combine our data!
If you have any questions about the $unionWith
pipeline stage or this blog post,
head over to the MongoDB Community forums
or Tweet me!