How to use the Aggregation Pipeline using the MongoDB Java Driver.

#Updates

#October 21th, 2020

Update Java Driver to 4.1.1.
The Java Driver logging is now enabled via the popular SLF4J API so I added logback in the pom.xml and a configuration file logback.xml.

#What's the Aggregation Pipeline?

The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines, just like the "pipe" in the Linux Shell. Documents enter a multi-stage pipeline that transforms the documents into aggregated results.

It's the most powerful way to work with your data in MongoDB. It will allow us to make advanced queries like grouping documents, manipulate arrays, reshape document models, etc.

Let's see how we can harvest this power using Java.

#Getting Set Up

I will use the same repository as usual in this series. If you don't have a copy of it yet, you can clone it or just update it if you already have it:

1 git clone https://github.com/mongodb-developer/java-quick-start

If you didn't set up your free cluster on MongoDB Atlas, now is great time to do so. You have all the instructions in this blog post.

#First Example with Zips

In the MongoDB Sample Dataset in MongoDB Atlas, let's explore a bit the zips collection in the sample_training database.

1 MongoDB Enterprise Cluster0-shard-0:PRIMARY> db.zips.find({city:"NEW YORK"}).limit(2).pretty()
2 {
3     "_id" : ObjectId("5c8eccc1caa187d17ca72f8a"),
4     "city" : "NEW YORK",
5     "zip" : "10001",
6     "loc" : {
7         "y" : 40.74838,
8         "x" : 73.996705
9     },
10     "pop" : 18913,
11     "state" : "NY"
12 }
13 {
14     "_id" : ObjectId("5c8eccc1caa187d17ca72f8b"),
15     "city" : "NEW YORK",
16     "zip" : "10003",
17     "loc" : {
18         "y" : 40.731253,
19         "x" : 73.989223
20     },
21     "pop" : 51224,
22     "state" : "NY"
23 }

As you can see, we have one document for each zip code in the USA and for each, we have the associated population.

To calculate the population of New York, I would have to sum the population of each zip code to get the population of the entire city.

Let's try to find the 3 biggest cities in the state of Texas. Let's design this on paper first.

I don't need to work with the entire collection. I need to filter only the cities in Texas.
Once this is done, I can regroup all the zip code from a same city together to get the total population.
Then I can order my cities by descending order or population.
Finally I can keep the first 3 cities of my list.

The easiest way to build this pipeline in MongoDB is to use the aggregation pipeline builder that is available in MongoDB Compass or in MongoDB Atlas in the Collections tab.

Once this is done, you can export your pipeline to Java using the export button.

After a little code refactoring, here is what I have:

1 /**
 * find the 3 most densely populated cities in Texas.
 * @param zips sample_training.zips collection from the MongoDB Sample Dataset in MongoDB Atlas.
 */
2 private static void threeMostPopulatedCitiesInTexas(MongoCollection<Document> zips) {
3     Bson match = match(eq("state", "TX"));
4     Bson group = group("$city", sum("totalPop", "$pop"));
5     Bson project = project(fields(excludeId(), include("totalPop"), computed("city", "$_id")));
6     Bson sort = sort(descending("totalPop"));
7     Bson limit = limit(3);
8 
9     List<Document> results = zips.aggregate(Arrays.asList(match, group, project, sort, limit))
10                                  .into(new ArrayList<>());
11     System.out.println("==> 3 most densely populated cities in Texas");
12     results.forEach(printDocuments());
13 }

The MongoDB driver provides a lot of helpers to make the code easy to write and to read.

As you can see, I solved this problem with:

A $match stage to filter my documents and keep only the zip code in Texas,
A $group stage to regroup my zip codes in cities,
A $project stage to rename the field _id in city for a clean output (not mandatory but I'm classy),
A $sort stage to sort by population descending,
A $limit stage to keep only the 3 most populated cities.

Here is the output we get:

1 ==> 3 most densely populated cities in Texas
2 {
3   "totalPop": 2095918,
4   "city": "HOUSTON"
5 }
6 {
7   "totalPop": 940191,
8   "city": "DALLAS"
9 }
10 {
11   "totalPop": 811792,
12   "city": "SAN ANTONIO"
13 }

In MongoDB 4.2, there are 30 different aggregation pipeline stages that you can use to manipulate your documents. If you want to know more, I encourage you to follow this course on MongoDB University: M121: The MongoDB Aggregation Framework.

#Second Example with Posts

This time, I'm using the collection posts in the same database.

1 MongoDB Enterprise Cluster0-shard-0:PRIMARY> db.posts.findOne()
2 {
3     "_id" : ObjectId("50ab0f8bbcf1bfe2536dc3f9"),
4     "body" : "Amendment I\n<p>Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.\n<p>\nAmendment II\n<p>\nA well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.\n<p>\nAmendment III\n<p>\nNo Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.\n<p>\nAmendment IV\n<p>\nThe right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.\n<p>\nAmendment V\n<p>\nNo person shall be held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be subject for the same offence to be twice put in jeopardy of life or limb; nor shall be compelled in any criminal case to be a witness against himself, nor be deprived of life, liberty, or property, without due process of law; nor shall private property be taken for public use, without just compensation.\n<p>\n\nAmendment VI\n<p>\nIn all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed, which district shall have been previously ascertained by law, and to be informed of the nature and cause of the accusation; to be confronted with the witnesses against him; to have compulsory process for obtaining witnesses in his favor, and to have the Assistance of Counsel for his defence.\n<p>\nAmendment VII\n<p>\nIn Suits at common law, where the value in controversy shall exceed twenty dollars, the right of trial by jury shall be preserved, and no fact tried by a jury, shall be otherwise re-examined in any Court of the United States, than according to the rules of the common law.\n<p>\nAmendment VIII\n<p>\nExcessive bail shall not be required, nor excessive fines imposed, nor cruel and unusual punishments inflicted.\n<p>\nAmendment IX\n<p>\nThe enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.\n<p>\nAmendment X\n<p>\nThe powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.\"\n<p>\n",
5     "permalink" : "aRjNnLZkJkTyspAIoRGe",
6     "author" : "machine",
7     "title" : "Bill of Rights",
8     "tags" : [
9         "watchmaker",
10         "santa",
11         "xylophone",
12         "math",
13         "handsaw",
14         "dream",
15         "undershirt",
16         "dolphin",
17         "tanker",
18         "action"
19     ],
20     "comments" : [
21         {
22             "body" : "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum",
23             "email" : "HvizfYVx@pKvLaagH.com",
24             "author" : "Santiago Dollins"
25         },
26         {
27             "body" : "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum",
28             "email" : "glbeRCMi@KwnNwhzl.com",
29             "author" : "Omar Bowdoin"
30         }
31     ],
32     "date" : ISODate("2012-11-20T05:05:15.231Z")
33 }

This collection of 500 posts has been generated artificially but it contains arrays and I want to show you how we can manipulate arrays in a pipeline.

Let's try to find the three most popular tags and for each tag, I also want the list of post titles they are tagging.

Here is my solution in Java.

1 /**
 * find the 3 most popular tags and their post titles
 * @param posts sample_training.posts collection from the MongoDB Sample Dataset in MongoDB Atlas.
 */
2 private static void threeMostPopularTags(MongoCollection<Document> posts) {
3     Bson unwind = unwind("$tags");
4     Bson group = group("$tags", sum("count", 1L), push("titles", "$title"));
5     Bson sort = sort(descending("count"));
6     Bson limit = limit(3);
7     Bson project = project(fields(excludeId(), computed("tag", "$_id"), include("count", "titles")));
8     List<Document> results = posts.aggregate(Arrays.asList(unwind, group, sort, limit, project)).into(new ArrayList<>());
9     System.out.println("==> 3 most popular tags and their posts titles");
10     results.forEach(printDocuments());
11 }

Here I'm using the very useful $unwind stage to break down my array of tags.

It allows me in the following $group stage to group my tags, count the posts and collect the titles in a new array titles.

Here is the final output I get.

1 ==> 3 most popular tags and their posts titles
2 {
3   "count": 8,
4   "titles": [
5     "Gettysburg Address",
6     "US Constitution",
7     "Bill of Rights",
8     "Gettysburg Address",
9     "Gettysburg Address",
10     "Declaration of Independence",
11     "Bill of Rights",
12     "Declaration of Independence"
13   ],
14   "tag": "toad"
15 }
16 {
17   "count": 8,
18   "titles": [
19     "Bill of Rights",
20     "Gettysburg Address",
21     "Bill of Rights",
22     "Bill of Rights",
23     "Declaration of Independence",
24     "Declaration of Independence",
25     "Bill of Rights",
26     "US Constitution"
27   ],
28   "tag": "forest"
29 }
30 {
31   "count": 8,
32   "titles": [
33     "Bill of Rights",
34     "Declaration of Independence",
35     "Declaration of Independence",
36     "Gettysburg Address",
37     "US Constitution",
38     "Bill of Rights",
39     "US Constitution",
40     "US Constitution"
41   ],
42   "tag": "hair"
43 }

As you can see, some titles are repeated. As I said earlier, the collection was generated so the post titles are not uniq. I could solve this "problem" by using the $addToSet operator instead of the $push one if this was really an issue.

#Final Code

1 package com.mongodb.quickstart;
2 
3 import com.mongodb.client.MongoClient;
4 import com.mongodb.client.MongoClients;
5 import com.mongodb.client.MongoCollection;
6 import com.mongodb.client.MongoDatabase;
7 import org.bson.Document;
8 import org.bson.conversions.Bson;
9 import org.bson.json.JsonWriterSettings;
10 
11 import java.util.ArrayList;
12 import java.util.Arrays;
13 import java.util.List;
14 import java.util.function.Consumer;
15 
16 import static com.mongodb.client.model.Accumulators.push;
17 import static com.mongodb.client.model.Accumulators.sum;
18 import static com.mongodb.client.model.Aggregates.*;
19 import static com.mongodb.client.model.Filters.eq;
20 import static com.mongodb.client.model.Projections.*;
21 import static com.mongodb.client.model.Sorts.descending;
22 
23 public class AggregationFramework {
24 
25     public static void main(String[] args) {
26         String connectionString = System.getProperty("mongodb.uri");
27         try (MongoClient mongoClient = MongoClients.create(connectionString)) {
28             MongoDatabase db = mongoClient.getDatabase("sample_training");
29             MongoCollection<Document> zips = db.getCollection("zips");
30             MongoCollection<Document> posts = db.getCollection("posts");
31             threeMostPopulatedCitiesInTexas(zips);
32             threeMostPopularTags(posts);
33         }
34     }
35 
36     /**
     * find the 3 most densely populated cities in Texas.
     * @param zips sample_training.zips collection from the MongoDB Sample Dataset in MongoDB Atlas.
     */
37     private static void threeMostPopulatedCitiesInTexas(MongoCollection<Document> zips) {
38         Bson match = match(eq("state", "TX"));
39         Bson group = group("$city", sum("totalPop", "$pop"));
40         Bson project = project(fields(excludeId(), include("totalPop"), computed("city", "$_id")));
41         Bson sort = sort(descending("totalPop"));
42         Bson limit = limit(3);
43 
44         List<Document> results = zips.aggregate(Arrays.asList(match, group, project, sort, limit))
45                                      .into(new ArrayList<>());
46         System.out.println("==> 3 most densely populated cities in Texas");
47         results.forEach(printDocuments());
48     }
49 
50     /**
     * find the 3 most popular tags and their post titles
     * @param posts sample_training.posts collection from the MongoDB Sample Dataset in MongoDB Atlas.
     */
51     private static void threeMostPopularTags(MongoCollection<Document> posts) {
52         Bson unwind = unwind("$tags");
53         Bson group = group("$tags", sum("count", 1L), push("titles", "$title"));
54         Bson sort = sort(descending("count"));
55         Bson limit = limit(3);
56         Bson project = project(fields(excludeId(), computed("tag", "$_id"), include("count", "titles")));
57 
58         List<Document> results = posts.aggregate(Arrays.asList(unwind, group, sort, limit, project)).into(new ArrayList<>());
59         System.out.println("==> 3 most popular tags and their posts titles");
60         results.forEach(printDocuments());
61     }
62 
63     private static Consumer<Document> printDocuments() {
64         return doc -> System.out.println(doc.toJson(JsonWriterSettings.builder().indent(true).build()));
65     }
66 }

#Wrapping Up

The aggregation pipeline is very powerful. We have just scratched the surface with these two examples but trust me if I tell you that it's your best ally if you can master it.

I encourage you to follow the M121 course on MongoDB University to become an aggregation pipeline jedi.
If you want to learn more and deepen your knowledge faster, I recommend you check out the M220J: MongoDB for Java Developers training available for free on MongoDB University.

In the next blog post, I will explain to you the Change Streams in Java.

Java - Aggregation Pipeline