Uncategorized

hi guys and welcome to this session on web scraping by Intellipaat web scraping is a technique used to extract large amounts of data from websites so in today’s session we’ll show you how to use Python to do web scraping but before we get into all of that please subscribe to our channel to never miss an update from Intellipaat now let’s talk about today’s agenda we start today’s session up by talking about what is web scraping which is followed by a brief introduction to web scraping libraries like beautifulsoup and scraping then we talk about how you can install beautifulsoup as well as Python parcel LXML on your system then we show you how to create beautiful soup objects using an input HTML and then finally we wrap today’s session up by searching a tree and getting the required output also guys we do provide end-to-end certification training on Python so if you are interested you can check out the course details given in the description below now let’s get into the session first of all let’s begin by understanding what web scrubbing is so suppose you have a website or you have a link for a website which is containing some information that’s getting updated regularly and you wish to store that information either locally or in a database or anywhere or you wish to access that information and then perform some manipulation on it now usually and before we’re scrubbing which is usually check if that website has an open API to provide that data so that you can just request the data using a URL end part however sometimes these web sites either don’t have a web api or don’t have the data that you want in the form of a web appear so in that case what you do is you parse the entire HTML content of their website as a string and then extract information from it that you need you can then store it or manipulate it however you wish to and that’s what’s known as web scrap now python has built has a lot of packages for web scrapping so one of the packages that we can use is called beautifulsoup it’s now in its fourth version so I believe beautifulsoup 4.4 is the current version and another one is scrappy now scrappy is less for web scrapping and more for building web spiders for web crawlers now these you can lead this is this could be considered an entire framework in itself and it allows you to create two spiders by inheriting classes and then telling it how to parse it as you can see in the example right there and this could be useful when you’re creating a web crawler and you wish to crawl multiple pages but if you wish to put all the single page if you wish to crawl create a web scrapper or just getting information from a single page that’s being run I think executed several times a day such as a stock market website for example that’s getting updated hourly so you can use beautifulsoup in my opinion now scrappy is also very good but it’s it’s more powerful than beautifulsoup in the sense that it is used to create two web spiders or web scrapers of web crawlers whatever you call them so if you wish to learn scrappy it’s the documentation page that you need to go to several tutorials on it’s documentation page there are examples so on and so forth in this session we’ll be looking at beautiful soup because it’s very easy to get started with and it’s quite easy to use this one so now let us take a look at what can we do with beautiful do so with beautiful soup we can if we have an HTML page offline stored we can either pass that or we can pass an HTML page from a website now parsing an HTML page is from a web address that needs you to get the string content of the HTML page so for instance as you can see this is the string content of the HTML page this will be returned as a string then you can parse it and then you can use it with beautiful soup so before we get started with beautiful soup one thing that you need to understand is that you need to install all the dependencies in your project now I’ll be using pip and that’s PIP env to create an environment and then install all the packages that I need any of your work with pip env if you have them you can message in the chat box no okay so let me give you a brief introduction of what PIP Env is why do we use it and how do we use it so basically suppose you are creating a project in Python using external libraries like a beautiful soup for there are many others like pandas numpy django Django rest framework all of these packages are available inside Python inside the pythons available

inside the Python standard library so you need to download them from the internet and then use them just a quick info guys Intellipaat does provide end-to-end certification training on Python so if you are interested you can check out the course details given in the description below now let’s get back to the session now you can download them but it could be the case that these packages have dependencies themselves that is they require certain packages to be downloaded as well and there those packages require certain packages to download and then those packages might even have versions so for instance let’s say Django your project only works on Django 2.1 and there were some changes in Django 2.2 which you cannot support add as of this moment now let’s say that another person or maybe you yourself wishes to work with Django 2.3 to create a new website now you would either have to uninstall Django 2.1 and then install Django 2.3 or you would have to create a new environment and install this install the Django version that you wish to use in that so this is where pip env comes so what it does is it creates a virtual environment all the packages that you install in this environment can only be used in this when you enter this environment so think of it as a container or a virtualized environment for you to install your Python packages so for instance let’s say that I wish to first of all you need to create an environment and for that you can type and start or different shell or whatever I usually start type Pip and shell and then you press Enter and it says here creating a virtual environment now this is important because everything that we installed in this virtual environment will only be able to use in this environment we won’t be able to use it outside this environment so it’s creating a virtual environment getting all the files required for it before Pip and there used to be virtual env so now we have Pip env virtual env is what is used underneath pip env but Pip env we could be thought of as an abstraction now you can see creating Pip file this file contains all the all the packages that your project might so as you can see we have created a pip env and an environment and as you can see this is the from wherever you have to run this command because your string or not recognizing this Documnet okay okay so so if you wish to run this command first you need to install Pip env so in your PC or in your Mac or whatever you have type pip install pip env I’m sorry installed yeah and when you press ENTER yeah mine is already installed so it’s not going to say that but you can install it is it installing oh it’s stuck at 65 personally executing something yeah yeah but it will install it’ll take some time but install and then you can use the command Pip okay Anirudh, I have a question over here so we generally use pip install till now we have installed so many libraries even scikit-learn as the tensor flows so we have used pip installed tensorflow What’s the difference between pip installed tensorflow and pip and their environment and what is it okay so the difference between them is that if I wish to use tensorflow version 2.1 for one project and tensorflow version 2.3 another project with PIP I cannot do that because in PIP packages are installed globally so for instance let’s say that I’ve been using for a project beautifulsoup version 3 and now suddenly I have a new project and I wish to use beautifulsoup number four Oh Version four so because of having different versions we need to either uninstall beautifulsoup version 3 and then install beautifulsoup version four in that case the previous project that we just created with beautiful soup 4 which was running it Stop working so to avoid this problem and install these packages for project basis so for instance I wish to use beautifulsoup version 4 for this project so when I install it using Pin env only only this project will be able to use that version so for instance let me show you first let me enter the environment so as you can see it didn’t create another

environment because it had already created one and it shows here that’s the name of that’s the label of the environment or unique identifier now I install Pip env and be installed beautifulsoup 4 beautiful soup 4 and then I press ENTER now it will start installing its installed and now it’s locking which is checking all the files and it’s done now as if I should show you just one thing this is yep fight so as you can see it’s showing the version of Python that I’m using it’s showing the version of it’s showing that I am using beautifulsoup and since I didn’t specify any version it will it’s going to install the latest which is this star so now if I wish to take this project with me on any other computer and use it all I have to do is install pip env on that computer and run pip env space and stop it will create the same environment that I have right now with same packages that I have and according to this like a beautiful soup the latest version and it will use Python 3.7 and it will start working on its own if I didn’t do this then the problem would be that the another PC might not have the same versions of all the dependencies that I might have so they might have beautifulsoup version 3 version2 maybe they do not have it maybe they have a different version of Python so on and so forth so you can think of it as an, as a way of creating an environment where you can think of it as a way of creating a container which your project can be developed so I hope that clears it up just thank you oh sure just a quick info guys Intellipaat does provide end-to-end certification training on Python so if you are interested you can check out the course details given in the description below now let’s get back to the session so okay now that we have installed the beautifulsoup library another thing that we need to worry about is parsers so parsers are basically how your beautiful soup library or parsers is basically you can think of it as a package used by the beautiful soup library to look at the HTML content and then create in HTML 3 or 4 part a to understand the structure of the HTML page so whenever you go to an HTML page in Chrome or whatever browser what you can do is right click click on inspect and as you can see this tree like structure this is how HTML usually works it is a tree of nodes nested into one another in a hierarchy so to understand how these relationships and parts all of these we use parsers now beautifulsoup supports many parsers I think it’s documentation page has yeah here it is so it’s supposed to HTML dot parsers which is pythons inbuilt to the library so you don’t have to install anything if you wish to use the HTML parser that is inbuilt in Python you can also use L XML HTML parser which we will be using in this tutorial because it’s very fast and very lean now when I say lenient it means that if the HTML content that by getting is not properly formatted that is some of the tags have been opened but not close so it’s not completely valid HTML there are some bugs or errors in it. The LXML parsers will try to rectify those mistakes and present the tree as instead of throwing in error in saying that hTML is not well formatted because the HTML content that we are getting may very well be out of our hands because it will be getting it from the internet there’s also html5 lib and LXML XML parser now the reason why you why we won’t be using HTML file lib also although it’s very useful it’s because it could be very slow as you can see here and so this could be a problem on the other hand LXML’s XML parser is used to parse example right now we’ll be parsing HTML it’s very fast and linear Python’s inbuild HTML parser is also very good however I’ll suggest that you use L XML whenever you can because well first of all it’s quite fast and speed is of the importance when you’re building a project and it’s very linear so it can rectify some of the mistakes that we get so that we are useful so first thing that we need to do is install the L XML parser and I’ll install another package called requests this is a package that he used to send requests to a website or

a web server and get the response in as a string or however you wish to get so what I will be doing is firstly installing env and you can install multiple package in packages in the same command so I wish to install LXML and requests, requests with NS so don’t forget the S because it might is all on the packaging request now when i press enter so LXML is install request set now installed this is also installed and now it’s locking locking basically means checking if the dependencies are fulfilled and then creating a hash out of it and then locking it it’s all it also creates a pip env dot log file as you can see mentioned right here and this is what to contains the current version of our Pip env before how it works of all the things that Pip env need to create dependencies so locking is a process it might take some time because we will start to packet is it one so after it finishes locking we can start with program hi Anirudh, this is Hemanth yeah actually I joined around 7:10 okay also so could you let, let us me in let me know that why we use web scrapper, web scrapper okay fine so many websites they contain some information that we might need in a project so for instance let’s say you’re creating a project or predicting the stock market tomorrow now for that you need some data so you can get the data from an API that is called a wave APN many stock market website to provide web API’s which basically are is you can think of them as URL URL endpoints which you can send a request to and get the data in a format that you wish to get they like JSON or XML however you might come across a situation where two data that you want from the website that you want it from does not have the web api that you can request the data form in that case you usually use a web scrapper now a web scrapper is basically just a Python program that gets all the data from the website and then parser it and allows you to extract information from that web page so for instance let’s say that I wish to takes title from this website now I can if it has an API I can request that API to provide the directly or what I can do more generally is just to create a Web API just to create a web scrappin get the information of the HTML page that is returned when I kiss this website and then parse it and then just extract this information then I can do whatever I wish with it which is basically save it in a CSV file txt file database manipulative or do whatever I wish to do with it so for instance to continue with the stock market example that is a website that maintain that updates the stock market information per second so it is instantaneous and you wish to get that information from that website so what you can do okay so stick to me there suppose if you have not been any web API for there so you can go direct over it means website over there and you can connect to data what you want yeah you can write a Python code that code will at information from that website or the webpage and then extract the information that you need, thank you Anirudh, This is Sanjeev, so my question is not for example the website which you are showing on screen right now yeah when I were reading the screen can only be a website content if I had want to list the specific content like on the website it is somewhere it is written as in Chinese language the example I need to read that particular targeted contains only yes we do so for instance you wish to just read this content am i right yeah yeah so basically what you do is just as we’ll be doing in this session is you create you install the beautifulsoup library then you write the code to get this entire HTML content this is loading yeah after that you parse it which is not really difficult to do because it allows you to do it quickly and then you traverse

down the HTML tree so for instance this is T of this Chinese content is inside a class of simple inside a UL element you UL stands for unordered list you can think of it as a list with bullet points so in a list with a class attribute of simple so we just tell the beautifulsoup library that okay there is a list on this page with a class of simple I want you to extract all the information from it and you do that and then you can do whatever you wish to do with it now you can do it with any of the web pages and it does not really have to be unordered lists you can do it with images you can do it with videos you can do it with paragraphs whatever as I’ll show you in this session so sorry I’m again having a question so for example if I’m if I’m reading your HTML content on any other website and it is all showing as a text or some graphical images so yes how do we know come to know that there are some kind of tables or maybe reports are attached along with that I need to extract the data from that sure so the first thing is when you are thinking of scrapping a website the first thing that you need to do is you need to go to that website and look at its HTML content to figure out how to get the information that you need because they haven’t provided a web page not just as simple as just taking taking a web URL sending a request in getting the response so you have to do a bit of research into it so you go to the recycle I’d say for instance I this is the first time I’m visiting this documentation website and I decide that I want to extract this information as you suggested then I look at the HTML content of this page like this now you whatever content that you wish to extract there because of the developer tools in Chrome or any other tools what you can do is you can click on this button which says select an element in the page to inspect it and then go to the content that you wish to extract so I wish to extract these yeah these Chinese links so what I can do is just simply look okay so I wish to extract these links they are inside a and our inside an unordered list with a class simple inside them there is a list item inside them there is a a link and I wish to extract the text of those links now you can do you can do it this way and it’s usually a better idea to be very precise about which what you wish to extract so if you just said that I wish to extract an unordered list that would create a problem because there may be another unordered list on the on the web page and you will have to then parse a lot of content just to get east a promise so be very precise and look at the content of the website before scrapping it that way you can understand how you need to begin scrapping it what are the what are the elements that you need to understand what are the what are what are the classes that we need to reference in order to get the data that we wished does that answer your question yes amigos and will you be showing us a code or maybe a hint of that once I choose which content that I want to read I can target my Python program to read the full impact link yeah so for my session I’ll be showing you some of the code so that you can extract information from a website so you don’t have to worry about that this session and so just a quick info guys in telepath does provide end-to-end certification training on python so if you are interested you can check out the course details given in the description below now let’s get back to the session okay so this is what we have right now and let’s begin with the session today okay so for this session because we need to get some information from a and H HTML page what I have for you is yeah and HTML page here it is so this is the HTML page that will be scrapping you can think of it as a you can think of it as a blog so a I’m sure you guys have visited several blogs so this is one of the blogs it contains posts each post will have a title that will have a link to a full-on

blog post and this is the summary of the title so when you click on the link it will create a it will go to a new the entire block post so for that first thing I need to do is I need to run a server now it’s perfectly all right if you do not have a server you can use any other any other website or anything that you wish to I’m creating a server because it makes it easy for us to send the request get the information from the request parse it and then use it now if I don’t have a server then I’ll have to open this HTML file in Python which isn’t always the case that you can do it because if it’s a website that you don’t own and so you don’t have the HTML content you need to get it outside so for that I’ll be using something called light server so it’s checking the version of the light server that I have yeah it’s working so the light server is working and this is the server that I have now if you don’t have light server installed what you can do is first install node.js this is no chairs you can go there and always remember to download the LTS button LTS thanks for long-term I think long term service so you can download it and install it it’s installed like any other package that you can found online and then you just you write NPM install – G – G stands for global so it will install it on your PC so that any project can use it l ite I already have it installed so I won’t be running this come on and so yeah so now what I’ll be doing is I’ll start the light server in the current directory make sure that you are in the current directory mine is in the D Drive web scrapping directory so I do it and because the name of the HTML file is index dot HTML I get this so this is the content that I’ll get now what I’ll be doing in this session is I’ll be getting this contained parsing it and then getting the information about the title the summary and the link of this title so this is article 1 dot HTML and article 2 dot HTML and so on and so forth and after getting the content I’ll store all of this in a CSV file because usually when you’re doing them scrapping you are getting the information from the web server in an HTML format parsing it and then storing it in some of them in some form of the other so like a CSV file an excel sheet a database a small SQLite database or a decent format any way you wish to do it so let’s begin so the first thing that I wished for you to understand is that you need to have installed on the packages that I meant so these were beautiful Soup for the quests and Alex ever so after installing these packages what you can do is I have this HTML file because I wish to show you how to pass an XML HTML file and many websites online either don’t allow for their web website or web page to be crawled or scrapped in a way the one question here this node.js suffocate the true you need to install in the same directory or it can be disturbed enough default Program Files territory default Program Files – when you can install it anywhere you wish – that is not no problem so Sunil is not where can I get the HTML file so for this session I’ve created this HTML file myself if you wish to allow the trader to upload it in get so that you can get it but you don’t necessarily need this HTML file you can use any HTML file that you can get the point that I’ll try I’m trying to get across in this video is how you can use beautiful soup and parsers and all of that scrap any website so because I’m using my own created HTML file I’ll show you how to use this but you can use it with any website provided they give you permission to scrape it now scraping is a you can say it’s a difficult thing because when you scrape a website usually those websites don’t allow users to scrape their website so the if you have like seen the structure of an HTML page you would have seen seen the structure of a directory that contains

the HTML pages for a website they have a robot stored HD file this file contains a list of the files and folders that the website owner or the owner of the server does not match the crawler to growth now if you are creating and you if you’re creating a web crawler then this might be an Sun but since you are creating a web scrappin you don’t need to do that another thing is that when you create a web scrap and what you do is you send a request to the web server now this means certain certain certain load on the bandwidth of that the server so whenever you are scraping the website do make sure that either you have the permission to scrape that website or you are not doing anything labor-intensive all that effect so that it could essentially bring that website to a halt most of the time it’s fine but many a times – there could even be legal actions because the website could get down it would be your fork so when you’re scraping your website always make sure that that you are not doing something very intensive that ends up hanging the hanging website and like creating multiple instances of the scraper and the scraping the website or another thing that you need to worry about is that the that you are not scraping something that the person does not want you to spray so just make sure that you are not doing that and another thing when just scraping you do make sure that information data scraping could not be get gotten from an API because it if it can be then it’s actually very easy to just get the information by requesting it from a URL instead of sleeping in entire web page of the site so a little sorry for interruption so when I say that if I’m reading a website or if I’m scrubbing a website how do I know that whether I have permissions to scrub the data from that website or not you have many websites actually so you can look at their robots.txt five mega size to provide that so if you’re creating a crawler using scrappy or creating a parser what you can do is you look at the robots 30 active and this is one of the most easiest things to do if you’re looking at the robots.txt file and they have mentioned the name of the HTML file that you are going to scrap and they have mentioned that they do not wish for this file to be scrapped or crawled or whatever then what you can do is you can just not do that because that would be against the wishes of the creator of the website or the maintainer of so don’t do that and yeah that’s the first way to do it and secondly like I said if you are using a web crawler yeah sorry where can I check that so in a web when you’re crawling it you can request it so for instance usually it’s in the root of the directory so if you wish to you can just Peru watch 60 right now it’s showing you what are the robot 36 depend on how to use it so let’s say that there is a robots.txt file ROP t s dot txt this if it’s in the root of the directory then you can just replace the local host 3000 with the name of the website so point straight up to W go to my website dot-com slash robots.txt so you can get it this way our website doesn’t have it because it’s just a local website and only I’ll be doing so you don’t have to worry about that here this is exactly the reason why I’m using s as creative website for demonstration instead of going on a website that I do not have any permission to access the scrapping activity you can do both on response page as well as the request page or only the response output data so you can do it with so there is no request and response as for because when you are sending the request you are getting the response as an HTML page so this is the data that you will be getting as a string and this is the data that we will be using like parsing it so you get it as a response from the website yeah okay so I mean the reason why you’re asking there are situations where you submit the form data to the server so that you can do the scrapping or the you know response to what you made that you can do the stepping okay so so the first thing is the form data is have nothing to do with scrapping scrapping can only be done with the information that’s being shown on a website so for instance yeah so form data you can submit it as a user but for that you will have to login to the website then finally the information that clicks up

now if you wish to automate this process this could be done using other packages like selenium and other automated automation packages are available for scrapping what we do is you we take a look at the website we look at the content that’s already present there then we send a request get the response HD HTML pages string representation and then we parse it and then extracting provision so forms don’t really fit into this equation if you want to create a a script or a program that submits a form automatically for you for that you can use selenium and other web automation libraries but beautifulsoup is not really for that okay so let’s begin with scraping the websites of the first thing what you do is you create a new python file I’m going to call it scrape dot Q by P why is the Python extension and now I’ll be importing so from BS 4 which is the name of the package in the steady tone you’re using I’m using Visual Studio code vs code so yeah you can use it it’s very good actually so I suggest that you if you are doing something important to use Visual Studio code this is the information about the what is about the editor spider spider and oh yeah spider and yeah so they are also very good so you can use them as well but they are more suited for data science and data analysis purposes here we’re just extracting the information and then storing it in a CSV file so I’ll be using your visual studio code and I’m also using an extension in the business to do good not each hold it pycelle so python extension this is why I am getting this these code completion hints okay so after doing that do you want to do we just dude your code we’ll just to do is a cool blonde IDE which ways to get bored is like 24 30 megabytes and it’s very small and it’s quite useful video studio is more useful when you’re developing an entire application using C sharp or any other Microsoft a combination okay okay so if you’re interested in using this you can download it free from Microsoft’s website if you chose to do code is the name Thanks okay so this is so we’ll be create this is third class that we imported beautifulsoup from the package ps4 now we will create a new instance of the beautiful soap class so you can name it anything you want I will be naming it sooo now you beautiful so now as you can see it has two arguments that I need to pass in it markup which is basically the string representation of our HTML page so for that I’ll use the requests library that we have installed after that cue you request dot get and inside the get I’ll paste in the web address so it’s HTTP colon slash slash local would say thousand and when I get it I want the text prompt text basically means just give me the string representation of the entire listing page so just give give it to me as a string and now I need to pass the XM the the parcels name so I need to pass ll XML so you can use L XML if you are using any other parcel you need to look up T parameter that you need to pass inside the beautiful soup library documentation page you can go somewhere down here it will show you how to use any other person yeah so I am using L XML so this is how you use it if you wish to use the built-in html5 HTML parser inside pipe you need to pass this parameter for XML you need to pass this and for html5 play music to pass this that is after installing html5 you if you are wishing to use the built in HTML person you don’t need to install anything however it could be slower than L XML so after doing that let me just print so let’s see what this gives us so

let’s run this py and the name of the file scraped or even scrape double by now if you are using Linux or Macintosh you need to type PID h2 n entirely but on Windows after I think python version 3.4 3.6 you can type just py and it will run so i can do this and this is the entire string representation of the website now one thing that you need to notice is that it’s not formatted very nicely so it’s very hard to read so you don’t know where the body starts where the body ends and what are all the tags that are inside the body type so for that what you can do is there’s a method called pretty fun so after using pretty pipe or Chicken George spray and as you can see it’s formatted nicely now we can see where each tag starts where each tag and it’s indented nicely now after this since we don’t need this and now what you do is you go to the website and you look at the structure of the HTML page so as to figure out how to extract information that you wish to extract so what I want to do is what what I always do is usually I take the information from one of the one of if I’m if I wish to get information from multiple sites so for instance let’s say that I wish to get the information from like for instance I wish to get input from multiple HTML tags so here we have multiple posts and I need to get its tract information of all the posts what you can do is you can firstly just get the information from one post which I have here extract this summary if the title and the link and then you can just replace that one with multiple so what I’ll do here is this first I’ll show you how to extract information from a single blog post for article so if I wish to get the title and the summary and the link of the first blog post for the latest blog post if it’s ordered ascending Lee then what I can do is I can just like POS key s or you can also type articles a TI tle s in the custom soup not fine now we’ll be using find and not find all find and find all are a little bit so I will be using fine find all will return all the instances of this so I want a do now how do I know I want a do is when we look at the HTML content I want this entire post this is a div with a class of post now so if I wish to use it what I do is I pass in okay I want the information inside a div tag the first impact that you find that has the class and we’ll be using a keyword argument name class underscore the reason we are not using classes because class is already at people so then we name the class which is I think post let me check it post yeah and then print article so if I run this again now you can see we have extracted just the latest puts it is the latest polls because we are getting the first blog post title and it’s appearing at the top of the page now let’s see what the information that we need to extract is yeah so the first thing we wish is the title of the blog post second thing would be the summary and third thing would be the link so the link is basically this right now I don’t have an HTML page for article 1 in article 2 because they won’t be necessary here so okay let’s look at the information that we wish to extract so the title is inside in h3 with the title with the class title inside that there is an a tag that has the the text as the title that we wish to extract so to use that we need to to is TI tle titled is

equals to article dot now you can access them as properties on an object so what we no want is to use the articles h3 claw h3 with the a a which is the link so we want to so we want to extract information from the posts each threes link so we can do this using article which is the entire post then case which is the heading and inside that we have a title we have a link with the text that we wish to extract so we go not a and we wish to extract the text we don’t wish to get the end all the information about the time so the tide has and let me just print it to see if it’s working correctly whenever you are working with a tire data like this you should always make sure that you’re getting the correct data so blog post one time so they’re getting the information exactly the information that would be now what we need is the summary so the summary we can get the same way just look at the HTML type content so it’s in the summary that we want is inside ap tag inside the post P is P stands for paragraph so very similar just go to article dot P dot don’t forget to use the text attribute this is the text that we have blog post one summary and I think this is the one that we’ll be getting if it said somebody and let’s try it again wave / py blog post one summary so we are getting the information that we need now we need the link the link is a little different because let me show you link so the link is very much we get the link very much the same way that we got T title but here we don’t wish to get the text we get we wish to get the value of the href attribute it’s just stands for hyperlink reference so it gives us this we should have / article and the score one dot HTML but to get that we don’t use the text property to use the H ref added for that we go article dot h3 but okay and now we want the is the practical so we access it like we are accessing a key in a dictionary now let’s print that and see if I’m getting it so slash article one dot HTM so we’re getting the add link as well now what we want is we want this information title somebody handling for all the boots so remember like I said that we first get the information for one post and then we create the HTML file so that we can get we create the Python code that we can get this information for of all the articles Stephanie so for that what we can do is we can use a full just a quick info guys in telepath does provide end-to-end certification training on python so if you are interested you can check out the course details given in the description below now let’s get back to the session know for example when i’m reading that link i found that within that link there is some like values or VBA data set is there or later yeah so how do I read that data so like inside the link there is data so for instance when you click the link it opens another page and the data is all that yes/no for example this will give me some kind of rows and columns gives me some kind of data and that website I want to get the data from there so you wish to get the data so for instance you are are you asking that if I have a table of data on a website how do I extract information from that table if if I’m understanding this correctly yes yeah so it’s basically the same process I mean exactly the same so let me show you okay index auditing so the tables

are created using I think the table tag then you create the table row with the table data and the name of the table data could be anything like in data now usually when you are using a web scrap you look at the website and most probably when you’re looking at a website they have some classes on those table data and table rows and table so that they can apply some CSS styling code now we can use this let’s say the name of the the class is data similarly I create another row and the class is data sample data it’s showing up on the website so this is the table that you were talking about now if I wish to get the data what I can do is a I can just tell let me just show you here yeah okay so I can just say don’t find a tag that is table data with the last name as the class as I get later yeah so this is the data that we’ll get and I want the text attitude of it let me see if it’s for you guys three protein bar so we are getting the data for a single row now let’s say that there are three rows and I wish to get data from all the three rows so what I can do like we were going to do in the in the demo which we’ll do but for now let’s address the query that you raised so for data in soup dot find and instead of fine we do find all that way we can get all the tags that have that are of pindy so which is this with a class of data so we get all these three and then what I do is I just print data dot text so as you can see I’ve extracted the information from the table so you can do this with any of the websites the the first thing that you need to do is you need to take a look at the HTML content to figure out how you are going to tackle this problem it could be the case that they do not have this class in that case what you can do is just take the information from all the table data elements there are only three and you can just use that but in a larger setting usually do have a class at this new set got it got it thanks and a little know for example when I’m this is like I’m simply read afford open to read all the table content I will save it in like a CSV file yeah yeah I’ll be showing out that okay so the thing that we’re going to do is get information from all the posts so for that what we need to do is we need to first find all so here we are using soup but find what I can do is I can just click save it here okay and so for article in soup dot fine or after we find all the div items with the class post this is what we have and indent it properly yes so now let me just print it we show you if it’s working correctly so we’ll be printing the title the summary and link it’s called yeah py crepe dot py and as you can see that getting the information that we need from all the posts so there are six blog posts were getting the title then we are getting the summary and then we’re getting the links so it’s working correctly now the

final thing that you need to do is we need to store it in a CSV file now thankfully for us python has a built-in library or scraping for sorry for dealing with csv files so what we can do right now is as you can see there is no csv file here right so we’ll have to create – well so we can first import CSV this is a built-in Python package so you don’t need to install it now but what you do is you create a CSV file so I think you do it with what you we underscore file equals to open the name of the file would be will be storing it in a file called data dot CSV now you can name it anything you wish to I want to qualitative CSP and the second thing we’ll pass is the mode now we wish to write to this CSV file so if you wish to write to a CSV file use the W flag if you want to read you don’t either you don’t pass anything or you can pass the our life will be passing the W which means you wish to read the data now so you write the reader after opening the file we need to pass it to the CSV dot writer and we type the CSV and the score fine now yeah Oracle type name it writer and limit writer so this is what we’re doing now you could if you wish to do it this way by just doing it this way however there is a slight benefit of doing it by creating a variable that stores the file handler and then we pass it pass the handle to the CSP writer I’ll show you the benefit in limited so this is what we’re getting right now now when we are writing to a CSV file we write it in terms of rows so first thing we need to do is we need to write the headers headers are basically you can think of them as the title so at the top of the CSV file will have three data columns which will show tell us the data that everything inside that column store so let’s go let’s write write good right row this is the method that we’ll be using and this takes in an iterable so you can pass it anything that we can look over I will be passing it a list of item so we’ll be storing that title the summary and the next so this is done now all I need to do is create a row and then type it so row and then the title somebody and the link now after writing the rows what we need to do is we need to close the file it’s always important to close an open resource because this could result in memory link if it won’t close it so this is why when I said that there is a certain benefit of storing the name of the file or of the file pointer or file hand or whatever you call them in a in a variable and then using it later so we’ll close the file and that way that information that we wish to write to the file gets stood now let me run this again hope I haven’t made any mistakes so far and it works and this is the CSV file assuming all the data is written into it and it is now one thing that you are noticing is there’s a line break or an empty line in all of these between all of these titles so let’s just look at the file yeah so we as you can see that is when you are writing it to a CSV file it’s leading to an empty file so the reason for that is whenever we are writing inside a CSV file we can just instruct it to not is create a new line so here we go it means that for a new line you don’t put anything so because we are writing using the writer this also enters a creates a new line and this also creates a new line so this so for that after doing that we go

py scraprack py so it works and now if we look at the file yeah so it’s working correctly so these this is the prop post this is the summary and these are the links now one thing that you are noticing is that the links are not properly formatted so we are only getting the end of the link we’re not getting the entire website like localhost 3,000 4,000 3,000 /li the link that we can just click to go to the website so to fix that we’ll write some code but why I getting that is because elution so usually links are not written as absolute links they are written as a relative so a relative link is basically link of the current site and append that link that we have so we have article dot article underscore one dot HTML and we’re appending it to localhost colon 3000 but in the HTML tag will only give slash article so if you wish to get the entire link which is this what we can do is you can again to write some code so this is basically just some normal string concatenation get this is the link now this is where we wish to insert the link then we use the string dot format method that is built into Python and then always remember to close the CSV file because if you don’t close it you might get a permission error as you can see here permission error because the file is open we are not allowed to add it so yeah I don’t wish to save it after I’ve closed it I run this again it works perfectly and now if I open it and as you can see we are getting the link as we wanted so now if I wish to I can just click on it unable to open because yeah okay so you can’t open the link from Excel but you can open it I think from a from visual studio let me see if I can do it yeah so here’s the entire link now another problem is that the adding two slashes I I made some mistake in the booth let me just rectify them so yeah and now let’s scrape it again you don’t save it he was scraped off rebuy and and there it is so the links are perfectly aligned there yeah so it’s working as we expect it to work so I hope that covers it is there any question so I needed in the beginning of this session or maybe response rate in that graph file you have shown us to like installing the light server so what is the use of that if I’m writing a Python program I’m drumming or not a machine yes so light server okay so this is the HTML page that I have I created it it’s called index dot HTML now if I wish to open it I can open it like this this will just open it on my local rapid but since I wanted to show you how to get the information from a website now for that you need to send a request to a web server so this is why I’m using light server light service nothing but just a dummy offline server it’s a server that you’re running that you are running on your own PC this is why we are getting the web address as local host and port is 3000 so if I want to not do that and if I were to use like the HTML page that I have created so for that I’ll need to instead of using the request page I can just open it give it the file name in index dot HTML and I think this is how it works I’m not apply cache or I’ll see py 3 don’t be why ok so oh yeah great about CSVs open and close it I want to say 341 and it works so the reason why I used light server is because I wanted to show you how to get information from a website but since I created the web page on my local machine I needed to serve it using a server a local server if you will so this is exactly why I used light server if you don’t wish to use lights have been experimented on your web pages that you have created and have it with you you can just use the code that I just wrote instead of using the

requests library all you can do is just open the file inside the beautifulsoup and then type in the name index dot HTML or whatever the name of the file that you are using provided that they are both in the same folder scraping file as well as the HTML file and then you can just pass it as normal now I’m using the request library so that I can show you how to get information from the server please paste the code in the chat box so do you wish me for me to paste the entire code if you want I can place the inter for you so this is the entire code so I pasted the entire code and let me just paste the one with the sorry with T yeah this one also okay so I pasted the one with the request library and also the one with the HTML page I think they should help you you can use the request library as well as normal page now you can use light server if you wish to and if you’re using light server you can use the request library not using the light server and using a local HTML page that you can open it inside the beautiful suit and lastly if you are using it on any other website to make sure that they have not asked you to not do that because that could be difficult for them to maintain the server and it well could also have legal repercussions so is there any other question so the quick question is for example I’m reading a file or maybe a website called Discovery Channel calm so now do I need to install like servers or you do not need to install light so our light service just for serving the web page locally so if you are getting the information from in our web site that’s online like you said Discovery Channel calm that case you do not need light server at all what you need to do is you just need to type in the name of the website here so it in our case because I’m using light server have no control : 300 now since you said you have for Discovery Channel calm I say we the dot-com and then you edit the code as per the structure of the HTML page that you get and the information that you need and after that you just run the code and it will work perfectly without any night server frequent the reason I use light server is only because I needed to show you how to use the requests library to get information from a website and I think yeah that’s it any other question okay I have a question I have a question for production use means let’s suppose in a production I want that I want to scrap of website of my competitor here there I they have multiple pages definitely like like Amazon so I need to go to the every page of I turn and get the price so that I can list my price just comparing the demo or doing any data science so for that do I to give all the URL one by one by one by one and how I do it Rick recursive in my steeple pages yeah so there are multiple ways of doing it as far as I’m aware Amazon I think provides an API I’m not sure you can look it up but even if it doesn’t well there are multiple ways of doing it one way you can do it is just scrape the entire website for links store all the links in a CSV file and then use a a Python script to take links from that CSV file and a scrape scrape that website another way you can do it is instead instead of storing it inside a inside a what should I say yeah inside a CSV file what you can do is use it inside use because if so for instance let’s say that you find a link on the website so what you do is go to that link and inscribe scrape that instead of so instead of just scraping the website that you have right now just the first link you see you open a request to that website web link here and then you start scraping that but you do need to be aware of the fact that when you are scraping about secondly to take a look at the structure of the website and then use it so for at least this website that we have we know the structure we know that okay the the information we need is inside a div class with inside a div tag with a class post inside in h3 and inside that there is an a tag with the text as the information that we need now if you’re just recursively going through any means then you may not know the structure of

the page and your code will throw errors so to avoid doing that just make sure that you know the structure of the pin you can do it either recursively you can store it inside a CSV file and then use it that way or I think the better way of doing it would be you this crap because it allows you to create a web crawler which does the exact same thing that you are asking now scrappy is a much more full fledged framework so if you really use it should go through the tutorials as I’m showing right now it shows you the how to create a scrappy project which is here how to create a spider how to run your spider and the spider is exactly what you are asking about it is scrapes entire website also make sure that amazon.com or any other website that you’re scraping doesn’t prohibit you from scraping the website so look at their user I think they have a user’s license or something so that will also contain that information and towards their HP find is also there so you can use either one so I’d suggest that you use the scrappy for creating crawlers and going through multiple links or you can use the technique that I just told you about getting all the links storing in the – us define and n15 is creeping all of the links either way the information that you get will be there but just make sure that the website that you’re crawling is not prohibited and the structure of the edge table is not okay so we were saying the structure needs to be known before right yeah is is there any intelligent algorithm like we have learned so many data science that can help us predicting that in any case that I don’t have to know because if I want to scrap a whole major website it is very difficult for me to do every webpage remember which where they have a tag and then code for the very lengthy code so do we have any algorithm for the same that we do when one go and that take care of 90 percent yeah yeah so not that I’m aware of but one thing that I think helpful here is that when you are if you are trying to scrape Amazon you’re scraping it for the information about the prices of the of the just trying to scrape it for the prices of the items that they are selling now when you look at amazon.com it literally Amazon now one thing is let’s say that I wish to look at all the shoes yes the shoes under $4.99 I get this website and now if I wish to get the price of all the shoes I can just take the link that I have here in fact this this is inside a span with the price of a price than a test price less food so I can scrape this website or if I have all the links for all the shoes I get is open them and price will always be inside a link with the span inside span with the ideal price block or price so what you can do is you can just get the link of all the websites look at the ID or the class I know that you wish to get information from and then just use it that way so just take a look at spare instead of instead of getting a div with a class for what we want is we want a spend with either a class of a size medium make another price the highest price block buying price string or it with a with an ID or price block out our price and that way you can get the information of all the faces now as for the algorithm ID I don’t think I’m aware of any of such indented algorithms and just assume the structure of the page and extract the information that you want you to extract you could probably use a regular expression or stuff like that but it would be very difficult and read expressions are very computationally expensive so either you can create a rock crawler or you can use the already known structure of the website and then get the emotion so so for the web crawler also we need to give the structure right we need be I settle for it kind of expected but one thing that web crawlers do is that they can’t recursively look at the other links that you provide as well so this is my web crawlers like for instance Google has a web crawler now it looks at the website and the store see all the information that it can get that it needs from the respect surface this era meta tags there are title tag body tag and if it finds those tags it will store the information about those tag inside it’s inside its database and then recursively look at all the Knicks net that website has to get a better idea of what this website is all about so that is what shapes crawler or spider or a scroller or a crawl errors spider is now scraper is basically that we just created just a

look at the look at a single website get the information from that website and that’s it you can create a spider after that beautiful soup library but it would be more difficult so I’d suggest use the scrap people creating a crawler but if it’s just one weapon did you need to get information from and you can use beautifulsoup okay one last question can we pass you user name and password as well with request if the site is yeah yeah yeah so this is I think a question similar to no forms question that we got earlier so user name and password if you need to pass in then I would suggest that you should use what should I say a selenium selenium is a web web automation tool so it will open the browser enter the user name and password that you it’s tell it to enter and then click the button in and login now usually the information that you wish to get is gotten through either an API or through a website if the information that you need to get is behind a login screen that is I think it’s safe to assume that the creator of the website does not wish for you to get in Commission because if it were a public information like the one that we just use like blog post and stuff like that so you can easily crawl that but if it were it’s not public information its information private to more specific users or even if it’s you then I don’t think they allow crawling for that so getting information from by putting in user name and password it’s not really something that you do with a subscriber you just look at a public web page that’s available without providing any information and you just get that if you wish to get information from yeah so you mean to say if I need to square facebook and with my login user ID and password that should not be also allowed with that that’s that’s actually not allowed that’s not yellow that’s not a now do you can’t do that you can maybe write a script is that uses selenium to open the browser on your behalf and then enter the username and password and then you can look at it inside the selenium selenium instance but web scrapping is not really about that so I don’t think if we do that Oh a little quick question so for example if I’m like crawling through like a website for example Amazon if it allows me to get the data for example there are there might be a multiple pages or multiple link with the pen name of CPU Puma and I want to know that within the price range for today these within the price range of say two thousand to five thousand what all categories of shoes are available with the brand name Puma how do I write a code for that or what are the steps to get the data sure okay so one thing is far as I understand you are saying that you need to filter the data and then get the data for that so for instance you’re saying I’m using few my shoes right it’s like a product name boomer and I want to know what all brand category or the name of the shoes with the Boomer brand which is available between 2,000 to 5,000 today and I want to extract the data for that so one thing you can do is firstly you can go to the Amazon website yourself then I search for a few mushrooms like I have done I hope you can see you on the screen so after that what you can do is you are looking for something within the price range that you want so you can filter it according to that let’s say that I want a shoe with a size food okay now this URL is what you need if I type this URL here I’ll get exactly the same thing that I wanted to get previously so you can scrape it by getting first of all this URL and looking at the structure now on this page you will only get the shoes beta which are labeled as pew mushrooms with the price range that you either the price ends I’ve chosen the size as the filtering factor so the size for and the name Puma you will get that and from there you can begin scrapping by looking at the structure of the HTML page so I want to get information about all the yeah so this is I can just look at this year s expand and expose that this is the class you can just look at the structure of the page and extract the information about the price range or the current deal that’s going on whatever so in the current range we have the price a – price – cool as 2123 so what you need to do is you need to first go to the page enter the filtering information go to the copy the website’s address this

website address contains only the name and information about the products which are inside the filtering criteria that you provided and then you can scrape this this web address so that way you can do it you got it thank you yeah sure any other questions we want to get all the navigation path of the entire website for example I mean all the pages are organized on another website with different animation paths right so our website level if you have to capture all the details yes we have a page so maybe so this is also crawling recursively so let’s say that so usually all the links that you need to use on the front page of let’s say Amazon or any other so for instance if I want to these are all the this is the navigation that you wish to get the information from so you can again like I said just sorry you can again just look at the HTML content that’s all the class all the HTML tags with the class of each menu and inside them we have all of the information that you need so you can do it that way now if you are asking me that is there any other way that we can just know the structure of the entire website some like for instance how many web pages are there what are the folders inside which folder what HTML page is there and we need to scrap all of them this is what web crawlers are for now you can use the crawlers as well also some of the websites you provide information about the structure of the website I in the About section sometimes it is sometimes it’s a file that they have so you can do it that way as well however if you wish to scrape the entire website and not just information from one web page I suggest you use scrappy for that to create a spider and then crawl the entire website with all the links that you can get or you can using beautifulsoup just to get all the links that you can that are of this website’s domain so all the links that start with amazon.de from scrapping amazon and then scrap all those web sites so all those web pages and that way you can do it as well there is no I don’t see it as an API that allows you to do it just to type the name of the website and will get you all the they maybe one I haven’t used one so far okay okay any other question the question here is do we need to extract the data one by one for each item so yes you do need to extract the information like we have done here the the best way to do it as far as I can tell you is just to extract information about one item first and then do it for all the items present on the page because the items have similar structure we can just use the same code and that way we get the information so for instance we need you to get information about all the posts that way so we loop over all the posts and then get information about all of them one by one and then store it inside the CSV file so yes a little that reminds me to ask you one more question that whenever like we gone through a particular structure of a website I need to say it had many very able to read the data and export it to my ps3 file in the similar manner so I can reuse it yeah correct yeah so yes so you do need so for instance let’s say that whenever you’re scraping the website you are usually scraping it to get information about a particular thing so a website is as far as I can tell not extremely huge so for instance even on in amazon web page they are only going to be 10 items 15 20 hundred so what you can do is you need information about the price to information about price so all you need to do is just go smooth not find on fan with the kleiss with the class our price this will get you the price of all the items on that page and then you just need to get surprised open that item dot txt and that way you just need to create one variable to get information about all the all the elements on the page now your question about to do I need to get the information about all the do we need to store information about all the

variables yes you need to store information of all the things that you need inside a variable now if you don’t need to don’t wish to store all the information you can just write it directly but even for that you need to create a row and all so I would suggest that when you’re scraping the website try to see a small portion of it all ocean that repeats itself so for instance in our website there are about six posts we are scraping just one post and this post structure is repeating itself if for instance there were like a million post I won’t have to change my code in any way shape or form it’ll work just fine just that CSV file would be a bit long so if you wish to do it that way I would suggest that just look at the repeating element get the information that you need about one element and then just run it I knew inside a for loop or a while loop or whatever you wish to choose and then write it inside that and one other question we have is that if we have Wikipedia there is no data in there only article then how do we do that okay so let’s go for a Wikipedia article by Wikipedia so the information that you are asking for is that if we have a entire block of text then how do we extract information from that so Wikipedia does actually have an HTML structure so if you look into it let’s say for instance I wish to extract just the history portion of it or just the headlines of all the all the portions of this article so I think all of them have MW headline yeah so this also has MW headline is the class so you just get all the titles of 80 MW headline what I’m saying here is that even if the page looks like it’s an entire block of text and it doesn’t have HTML structure there will be some now if you wish to just extract this paragraph from the entire HTML from this entire block of X you can use there are several ways you can use a regular expression which is used to X at a pattern textmate matching a pattern inside an entire block of text or what you can do is you can just take this information split it using the /n so which means that you want to create a list of all these paragraph and then just take the last item of the Parata provided that the content structure has not changed in the last program is what you want and you can do it that way as well so if you’re using it on Wikipedia or anything like that can we do why sources for example if I’m developing my own website and I have like 10,000 records which is there to upload on that website so unlike that I’m reading from our website I can simply write to that website so that it it goes uploaded in one go and I can have all the data available on my website okay so as far as I can understand you are saying that if you have a website that you need to enter some information inside of let’s say that you have a website and a CSV file the CSV file contains all the information about employees and there is a form on the website that you need to insert all the information into can we use all the scrapping for that is that question correct okay so again like I said the web scrapping this not really for that for that you need to use something called selenium episode ABN um so this is browser automation you can use selenium there is also a selenium package for Python so if you have quite good with Python or like using Python you can just use selenium for that now you can just tell it to do it what you want it to do so for instance you give it the URL of the website tell it okay there is a form inside the website I wish for you to enter get the record from a CSV file enter the name of the employee and whatever it is according to the information that you get from the CSV file and then press the submit button and then wait for it then do it again then do it again then do it now you can do it this way selenium is built exactly for that as you can see we have one for Python as well so selenium is print exactly for that but web scrapping is not about web scraping is about getting the information from the web page a public appeal not about sending information to other so we work with response is not the request this is why we using the request library to get the response from this website convert it to text and then beautifulsoup can use it so if you if you plan to use it to submit information to a forum I’ll suggest that you use selenium I think we do have I think courses on selenium as well so if you are

interested in maybe you should refer those but the web scrapping is not really important I just show one question how to deploy this in in production you have it in this code locally hello you deployed maybe automate it maybe running after each one day or something anything so how to deploy this thing in particular so okay so the first thing that you need to understand then this is not a production ready code okay this works fine but if you are scraping the being afraid I suggest that you use object-oriented program that way you can create a class that will have methods that will allow you to scrape websites now after that you have run the made the production ready code and you’re sure that it’s going to work fine then what you can do is either deploy it on a Linux Linux virtual machine online like using a using as your Amazon Web Services or any other PC if you want to run it on other pieces if you want to run it on your PC then you need to keep your PC running for the entire day or forever since it’s a production server but if you wish to run it on any other website you can use limits you can use Windows or any other web servers web server operating system and then put this file of yours which is crap dot py or whatever you call it depends on you and then put it inside that virtual machine and inside that virtual machine you just you can write as there are Python extensions that allow you to schedule the execution of the script or also I think in Linux there’s something called cron jobs we are Owen Jo – yes so cron jobs have been exactly for that they scared you in a command or script on your server to run automatically at a specified time so for instance let’s say that you have a web scrapper you wish to run it every hour and put the information inside a CSV file so you can just dip paste this scrape load P my file or any of the files that you have eaten the scraper you put it inside a linux server that is maintained by as your Amazon Web Services or any other service provider that doesn’t really matter just make sure that it’s running all day and then schedule a cron job now scheduled in the cron job is out of scope for this session but you can look it up it’s fairly easy just opening the chrome tab and adding some items it works then so just scheduled a cron job to run this file every one hour or however you wish to choose and then do it that you can do you can probably do it that way also – so we can install this in Windows machine means we can yes we can in Windows I think there’s something else or many cloud providers like Ezard to provide you with the facilities to run regularly so maybe you can use that but for Windows I’m not really aware if there is a command that you allows you to schedule that running over come on maybe you guys have right yeah okay when what it thank you thanks Oliver you’re sure also I think maybe there is a you can write a Python script which is running in the background continuously and then every hour it’s running that script so you can run it that way then but I don’t think that’s a command inside the Windows operating system like that is Linux which allows you to run a job or a script within a specified interpret reputedly can be obtained this code mean wave escape code in the modular building just like I am going to create imagine a model over there yes machine learning right yeah so yes you this code can be put in any of the more models if you wish to but I think the way it works is that you first get the information from the website you can store like for instance we got from the website and we’ve stored inside a CSV file and then we use that find that information that you have gotten from the web scrapping for training your models so you can or if you wish to you can just get the entire information and not store it and just use it directly but I would suggest that you store it inside a CSV file because that way you will know what information you use to train your model so you can do it if you wish to just to get the information from the website and then store it in and then start using it without storing it inside of a zip file and start using you to train your module but it will be better in my opinion to just use it to store the information inside a CSV file and then use that CSV file to retain the model okay guys so this brings us to the end of this session so if you have any queries please comment it down below and we’ll reach out to you immediately also guys do subscribe to our channel to never miss an update from intellibid thank you

You Want To Have Your Favorite Car?

We have a big list of modern & classic cars in both used and new categories.