2.5 hours



Any Level




Packt Publishing

Gain valuable insights from your data by streamlining unstructured data pipelines with Python, Spark, and MongoDB

Expected learning & outcomes

  • MongoDB as a non-relational database based on JSON documents
  • Set up cursors in pyMongo as a connector to a MongoDB database
  • Run more complex chaining and aggregation queries
  • Connect to MongoDB in pySpark
  • How to write MongoDB queries using operators and chain these together into aggregation pipelines
  • Real-world examples of using Python and MongoDB in a data pipeline
  • Using Mongo connectors in pySpark for high-performance processing

    Skills you will learn

    Analysis, Big Data, Concentration, Data Analysis, Data Science, Data visualization, Empathy, Graphic Design, machine learning, Machine Learning Techniques, Media, MongoDB, NOSQL, Physics, Programming

    About this course

    This course is a comprehensive, practical guide to using MongoDB and Spark in Python, learning how to store and make sense of huge data sets, and performing basic machine learning tasks to make predictions.

    MongoDB is one of the most powerful non-relational database systems available offering robust scalability and expressive operations that, when combined with Python data analysis libraries and distributed computing, represent a valuable set of tools for the modern data scientist. NoSQL databases require a new way of thinking about data and scalable queries. Once Mongo queries have been mastered, it is necessary to understand how we can leverage this API in Python's rich analysis and visualization ecosystem. This course will cover how to use MongoDB, particularly if you are used to SQL databases, with a focus on scalability to large datasets. pyMongo is introduced as the means to interact with a MongoDB database from within Python code and the data structures used to do so are explored. MongoDB uniquely allows for complex operations and aggregations to be run within the query itself and we will cover how to use these operators. While MongoDB itself is built for easy scalability across many nodes as datasets grow, Python is not. Therefore, we cover how we can use Spark with MongoDB to handle more complex machine learning techniques for extremely large datasets. This learning will be applied to several examples of real-world datasets and analyses that can form the basis of your own pipelines, allowing you to quickly get up-and-running with a powerful data science toolkit.

    About the Author

    Alex Rutherford is currently a Research Scientist at Massachusetts Institute of Technology, researching scalable and cooperative human-computer systems. He has over 10 years' Python programming experience from his PhD in computational physics at University College London, and post-doctoral research into modeling social systems and data science for the United Nations and Facebook. Alex has worked on numerous end to end data science projects using a wide array of data sources, methods, and technologies from social media to constitutional documents. He is also director of Data Apparel, a social enterprise that fuses graphic design with data visualization to promote empathy with vulnerable populations.


    Lore delivers value at the intersection of learning, interests and skills.

    Learn from Domain Experts

    Access learning options recommended by industry experts, professionals and thought leaders.

    Search & Compare

    Quickly search, select and add learning options to your learning list.

    Personalize your feed

    Tell us more about yourself to access the latest learning options, curated just for you.