Please consider making a donation (https://pushshift.io/donations) if you download a lot of data. This helps offset the costs of my time collecting data and providing bandwidth to make these files available to the public. Thank you!
If you have any questions about the data formats of the files or any other questions, please feel free to contact me at jason@pushshift.io
In this directory, you will notice that some months have an .xz extension as well as a .bz2 extension. The data in both are the same, but the .xz files offer a higher compression ratio. If an .xz file is available for a specific month, I would recommend downloading that instead of the .bz2 file. Eventually all files will be available in the .xz compression type.
File sizes starting with the October 2017 data dumps are approximately 10-15% larger in size. The reason for this is due to the fact that Reddit started offering a new field for comment objects called "permalink". The values for this new field are the direct link to the comment itself.
Reddit banned the subreddit /r/incels in early November of 2017. This happened as I was re-ingesting data for the month of October, 2017. Since the data was no longer available via the Reddit API, I still had the data from my real-time ingest database. In the interest of research, I included these comments in the October 2017 dump. The comments from the real-time database will have a score of "null". This only affects a subset of /r/incels comments for the months of October and November 2017.