Guest Jake Cahill | Web Scrapping Automation Bots and AI
Is web scrapping ethical, how can it go wrong?
https://twitter.com/tech_ccs
https://www.linkedin.com/in/jake-cahill-maine/
https://www.ccstechme.com/
https://dev.to/kaelscion
Spotify | iTunes | Sticher | Google Play | Player.fm | MyTuner Radio
Transcript:
Dave: Hello Friends! In this episode I sat down with Jake Cahill of CCS Tech, where he creates web scrapping automation bot and also technology that can be used to circumvent those bots from scrapping your sites content. Really enjoyed a good talk and I had learned some things and I hope you do too. You can find all Jake’s contact details in the show notes and so as always you can email me at www.podcast.dave-albert.com or on twitter @dave_albert, enjoy.
[music]
Dave: Hello and welcome to the podcast. I’m your host Dave Albert, in this show I talk about technology, building a company as a CTO and co-founder and have guests to discuss their roles in technology and entrepreneurship.
Dave: Today, I’m joined by Jake Cahill. Thank you for very much for joining us Jake.
Jake: Thanks for having me.
Dave: Yeah, it’s a pleasure. Would you give us a little bit of history of yourself.
Jake: Yeah, sure! I’m a web developer bot specialist and a scientist. I know a lot of that kind of sounds like their super different, like you know “oh he’s one of those utility knife developers. I’m a software engineer in Maine, USA and my job basically I’m a contractor and a freelancer and my job is to develop either web crawlers or web crawler filters for companies in my area. A lot of what I’ve done in my career, stems from what you call that child-like curiosity and a real lack of caring what catches fire.
Dave: I can relate, I can relate.
Jake: So, a lot of what I learned about with computers, my first time with a computer, I took it apart and put it back together and everything. Literally burst into flame, power supply caught fire hooked up backwards. And I thought that was the coolest thing ever, my though was, if it burst into flame when it’s broken what can it do when it’s working? And that’s what kind of got me into it. You know I tinkered with a lot of stuff throughout my life and after high school, after high school actually I trained as an electrician. I got accepted by a UNION and all that stuff to be a resident electrician, but the ethics I guess of a UNION kind of I was torn by them so I guess I decided to kind of back away from that. And I ended up working in IT, that was I worked as an IT technician and a software engineer. And basically what I came down to was I worked at a company that was really understaffed, it was really really profitable banking company and they were really under staffed for a thousand employees, there were five IT people and one software developer for the whole company managing everything. So I wore a hats and a lot of my job I had an inevitably autonomy and I thought myself code to do that. And it just kind of snowballed from there, that the one developer in the company found my source code on a team foundation server cause it’s a Microsoft shop and he kind of came to me and I thought I was in trouble, cause I was not supposed to have access to that and in fact he actually was really happy that there was somebody else that liked code and he thought me and mentored me a little bit. I wanted to switched over to the software development side of the company but I never ended being able to before I left and but from there I just dove into it. I wrote C# for awhile and then I started with Phyton code and something that I really really enjoyed doing that I found and I needed to do a lot was research. I found that I have to google things and do things manually a lot, so seeing is when I really got into code it was to automate what I already done manually I figured why not do that with like the web. So rather than going copying and paste and stuff by hand, learn the request library and urllib and beautifulsoup and selenium and all these web scraping libraries. Then found that on some websites and you get filtered after awhile, at that point this was long before Cambridge Analytica, this was long before data collection was this malicious, just juggernaut that was worth of multi-billions dollar corporation and things like that. This when was basically,web filters were there not necessarily protect privacy but to protect cost. It was you know for every, if you’re even with Phyton’s GIL, concurrency is still a thing through parallelism but you can still hit a web server with hundreds and thousands of requests per second if you want to and a lot of time with AWS scannings team then you paid for CPU cycles, you paid for everything. You know a lot of times you are filtered simply because that website had no chances of ever selling you anything. You were taking their resources and they couldn’t make money back. The problem was, I got good at that and then the whole data mining fiasco exploded in everybody’s faces and unfortunately at the time I was doing web scrapping for like lead gen. You know there were times you would scrape forums for customer information or email addresses people are good at putting their mail champ and I totally admit at that point, you know this was again before it was demonized at that point It was really, I was really part of the problem I was helping. Companies at work necessarily spammers they were like the day was just never heard that that kind of data collection was doable for somebody other than a big tech company and you know I was there. But then, when all that kind of hit the fan I was like wow, this is really unethical game at this point, collecting personal data on people. So I backed away from that and now I do build data aggregators that’s something that I do a lot. I do have a really strict kind of moral code about it though, that I will collect product information and usage information and basically information on products and processes and not on people. I kind of draw the line like hey, can you get these email addresses or can you find me this CEO’s information on LinkedIn, it’s like yeah I could but I’m not going to. I’ll collect like company data, usage data or product data but not personal data. And then kind of flipped the other end to well I have at this point, I’ve had six years developing web automation and data scrappers. Why not teach filters to be smarter to detect bots built by people like me that are after personal data that are looking to scrape Facebook’s API in order to get all of your connections and everything like that. Basically I stopped collecting people’s data and then start up building systems to defend people’s data well at the same time gathering company data. Basically that’s how I double dip in the bot industry. I build bots that can get past the filters I build and then build filter to keep people from me.
Dave: Nice.So who would be like your typical customer or client?
Jake: Typical customer or client is like small to medium size client it’s like small to medium business that does a lot of going to their distributes websites who are also small to medium size businesses and basically copying and pasting data by hands so that they can send it to their customers. So like when they get a customer request, hey does your, do you guys have you know this particular product in stock. Well usually the customers that I have or the drop shippers or their you know they don’t have a physical warehouse that can’t just check so they got to go to the distributor, they’re gonna send the distributor an email or check their stock levels, so what I basically do at that point is build bots that will allow them to one click and have a scrapper go out and grab it or actually build their own database that runs you know every two to three hours against their manufacturers website so that they have a push button solution that would be, no my distributor doesn’t have that in stock.So basically, any business that doesn’t have the tech power to build an API that needs to contact another business that does not have an API.
Dave: Okay.
Jake: Yeah, and then as far as personal data goes it would be, no with GDPR in Europe and same kind of legislation being fought here in the US you know trying to get it in place a lot of people are really really sensitive about their consumers data, their customers data whether it’s a forum or a blog or things like that you know I’ve heard one of your podcast actually, I listened one of the episodes you get several pings a day from china. A lot of people they deal with that, when you have to go to google analytics and set up a bot filter just so you can see how many actual people have come to your website, actually that’s an issue when usually the number goes down by like a factor of 10 when you filter out bots. You know so that’s also where I come in, you have a small website and you don’t have the money to pay the licensing for you know learning IPS or ITS or one of the big CISCO products or something that actively filters connections and then when connections get past it learns how they got past, if you don’t have that kind of licensing bank role then I build something personalized for you granted that it’s not very modular for your business changes that need to be reconfigured because you want me to do it in three weeks it’s not something I can make very general but still that’s my turbo customer.
Dave: That makes a lot of sense, so I run a really small restricted diet food e-commerce for a little while and there were a lot of manual work and taking all of the gluten-free foods, and dairy free foods and getting the content from manufacturers, pages and trying to copy and paste that into shopify. I can totally see how, I would’ve thrown so much money at you to not have spent days and days of that.
Jake: One of the customers that I like a lot that I’ve dealt with in the past, I haven’t heard from them in a couple of months and I’m assuming everything’s okay or I’ve been fired. That’s typically how it goes in development, either everything’s working or they replace you but they were a middle man for grocery store so they would usually have to act as the FTA’s database a hundred time a day to get information. They also had a database that was from the farmers to the wholesalers like what farmers are charging per unit or you know for their produce or meat whatever and a lot of times they would help the farmers set their wholesaler pricing and also the wholesaler set the grocery or the retail prices but the problem was they would manually be going into these excel spreadsheet that after download the excel spreadsheet a hundred times a day because it updated after 10 or 12 minutes something crazy like that it basically I built a bot that did that for them so every time, I didn’t put it a listener but every time that the csv or the excel spreadsheet was updated which usually like every 10 or 15 minutes the bot would reach out it was set as a w-get on their linux server it would reach out, download the csv and then use pandas or beautifulsoup if it was an xml to parse and then feed the information in their own personal database and so they basically automated you know half of their day both of these guys were able to alter productivity and they were able to go home 8 pm which was great. It was just two man shot, they were just making it work. I mean sitting in this co working space you know 12 -14 hours a day with a little bit of automation and one time outlay of fun they were able to cut that work day at least in half and then you buy an actual office space in Downtown, Portland which was pretty cool and that’s ultimately what I love about web automation is so much in the web is just htp requests or https or tps whatever layer you want to put on top of your http and you know. These servers that’s kind of the thing I love so much about it is these machines are computers, the servers, web servers or even IDS, IPS web filters they all talk they just don’t know how to listen and that’s the concept that I’m trying to explain to people a lot because I’m sure as a tech guy yourself you know a lot of people will ask you, how do you do this? I can’t even work my iPhone, it’s like well it’s kind of like asking somebody who speaks both English and German. How do you understand that person speaking German, it’s like oh they’re telling me what they want me to know I just don’t know how to listen. So you know people get technical with that stuff but you know that’s kind of my mission to dispel this stuff or at least make them not have to deal with it but automate.
Dave: Yeah, that’s what funny is that balance that good tech people have to have of both fanaticism and laziness. You have to be some sort of lazy to want to automate everything.
Jake: Absolutely, that is so true. Oh my, I never thought of it that way. I try to go through my life trying to tell myself I’m not lazy but halfway that’s what my fundamental driving forces to not do stuff. And that’s absolutely true. Which is why I spend so long writing a script so I don’t have to do stuff anymore. You think it’s kind of like counter intuitive is like, ah I’m going to right this script so I don’t have to do stuff and you spend like six weeks doing it and then it changes and you start all over and not realizing for no reason and you’re starting over again. Okay that makes sense, and I think as a tech person, you also need to be like, It’s kind of like in sports when something bad happens on a short memory that’s kind of something we as tech people need to embrace as well. And we all have short memories. And as soon as something stops working, we just go right into it again even though last time we dove into it, we’re like lost half our friends and haven’t seen our family in six weeks we don’t remember that part though. Because this is new and interesting and I want to know.
Dave: But for me it’s like the whole, the swings between thin client, fat client, thin client you know so first there were mainframes, desktop applications and then there were web applications and now there is progressive web apps
Jake: Yeah, the PWA.
Dave: So, I mean it’s just keeps going back and forth and back and forth and back and forth
Jake: Absolutely. I find that a lot of our job as engineers as software folks is to solve problems that aren’t problems. Like I think a problem becomes a problem when it;s something that’s important but a lot of the problems that we solve aren’t really problems they’re like nuisances, it’s like a problem. Like a problem would be like, you have a flat tire, the problems we solve is like your tire’s pressure a little low. Like they’re not really issues, they’re really not a big deal but we spend a lot of time and effort solving them. And that’s what I love about it.
Dave: Oh, you know what we have a framework that will help you with your low tire pressure.
Jake: Yeah, to me that’s what I love about it though it’s kind of like a problem solving isn’t the root of what we do. And I honestly think that as a developer, I don’t care what the problem is once I know that somebody wants it solve, though even I want it solved that becomes an obsession. It’s like I need to solve this issue, it’s kind of like when you have the rubic cubes but you’ve been working on for 12 years and you still haven’t solve it but you work on it every day it’s because I need to solve this stupid thing like that point it just becomes a principal of it like I don’t care if people really need to get Facebook messages a second early but they said they do and I don’t know how so that’s what I need to solve because I don’t know how to do it makes me feel angry, cause I want to learn you know . Totally.
Dave: Yeah, I can 100% relate.
Jake: Yeah, absolutely.
Dave: You’re company’s name it’s CCS Tech, is that correct?
Jake: CCS Tech, yeah.
Dave: Does is stand for something or?
Dave: Well when I started fixing computers when I was kid,it was Cahill Computer Solutions. Yeah, but that was really long so we just shorten it to the initials. Unfortunately CCS doesn’t really say what your business does, and I didn’t wanna be one of those people that hands you a business card which has three letters on it and you’re supposed to figure it out. So we changed it to CCS Tech just so that it will allow people to know what we did and you know on our branding and signage people wouldn’t have to think or investigate as to what this company was. They’d know, all that’s technology computer they do something with technology. So yeah it was more of a branding thing than anything else.
Dave: Cool. So with the, I don’t want you to give away any of your trade secrets but what source of a ML types of techniques and algorithms are used for the adaptive scrapping or the protection.
Jake: Well, scraping and protection it’s tough is when you’re using ML. Even something as simple as like your linear regression or your K nearest neighbors or any of the K algorithms. A lot of times when you’re talking about web filters, web filters ultimately don’t use ML that often, if you’re putting a web filter most of your common hosts even the bigger ones like Go Daddy, Wordpress or WPN Engine, a lot of those are really really bad at pattern recognition and I think if you’re going to, we do build ML algorithms that look at how proxy’s are cycled but we don’t actually put it them on any of our bots honestly because we don’t need to. A lot of the ML stuff that I will use is I monitor the free proxy sites like my favorite one that I actually used a lot to my own projects, the psycho-proxy is I think it’s a www.freeproxy.net and you know basically you build a scraper to pull down a big dictionary of IP addresses and you pull down a big dictionary of user agents and you just cycled them and what I find fascinating is this technique by no means new? It’s not even middle aged, it’s ancient and these filters still don’t do anything about that. If you cycle your user agent, if you cycle your IP address even within the same country or the same state like if you used those services and you get only United Stated or UK IP addresses as long as you run like a different subnet form the internet service provider your server doesn’t even check even if you used different IP address to access the same page a hundred times in a minute they don’t check for it most of the time, they don’t care.They say nope, different user agent, different IP must be a different person come on in. And I think part of that is due to you know clause of implementing something that learns and watches but a lot of the machine learning that we actually put into things is based on basically where these proxy IP’s come from and if there’s any patterns of the addresses. And basically which you know, figuring out the country of origins not terribly difficult but basically where these IP addresses come from, is there any numerical pattern to a proxy IP address versus legitimate one and if there isn’t than is there a numerical pattern to IP addresses that come from a certain proxy servers, is there filter we can use based on the IP address itself that basically knows before it’s been blacklisted, hey that is a proxy IP because it meets the certain formula. Even if that proxy IP is brand new, it’s within 5 mins of being released you could basically feed it into a calculation and see if it is a legitimate IP or if it’s a proxy IP. And you know proxy IPs they are legitimate, they are purchase from internet service providers it’s not like they’re dummies, they are hosted somewhere. But those proxy’s are not purchased by hands, they’re automated they are generated via automation just like our bots are which means if it’s automated there’s a pattern to it. Now that’s why we use the term pseudo random numbers in programming it’s because nothing is truly random when you ask a machine to do it. But as far as our bots go, we don’t really need to use machine learning to get these bots around things. I mean the most complicated thing we’ve ever really needed to do to keep a filter from totally shutting us down like we’re using like an xml parser and html parser in a request library, that’s really really fast that’s what we do most of the time, a lot of times we multi turn them as well. We’re through trying to respect the web server but sometimes you switch user agents to switch IP’s still some of them getting shut down. If you use a web automation like puppeteer or selenium and when you try to connect, you like retrieve the page and use selenium in a click on a link rather than submitting it via json request or an htp request. The filter will not even recognized it because you click the link rather than submitting it, it’s yeah they’re not smart. AI is not required for web scraping, but when it comes to trying to defend against filter I think that’s where machine learning has a ton of potential is we can’t unfortunately do anything about it now. But if we watch the bot ecosystems that are out there and I don’t mean bot nets, I think a lot of people when they hear the word bot they think of something evil and most bots are not, they’re simply curious I guess it’s the best word to use. They’re not trying to get anything or enslave anything they just want information. But the problem is they tax web servers and if we watch the ecosystems, that’s why we watch bot frameworks a lot like your scrap ease and a selenium and even beautiful soups, request libraries, response libraries, how people are submitting json requests, that’s where the key to blocking them is. And again we can do that on an individual basis but if we’re going to release a universal framework for blocking all bots based on this pattern there are technologies simply not there yet. When I say “are” I mean, mine I’m a slow engineer. So developing a framework that would work under all circumstances for filtering bots for all use cases it’s very difficult. But I think that’s where machine learning has a lot of potential, for learning how to, basically D compile or destruct, deconstruct rather the algorithms that are used to purchase these proxy or tour IP addresses and therefore how they’re built so you can filter them.
Dave: Do you have any issues dealing with services like cloud fare and akamai? As the scraper as opposed to the protector.
Jake: Ah well, I mean I haven’t run into issues with them. I mean if you’re going to use like a network VPN, VPNs in like network VPNs are tricky only because the traffic’s encrypted but if any basically any information that you can make via an http request, like a poster and get request it is pretty much available and that’s what tough with web scraping is a lot of time if you ask the web server “hey can I have this?” it really doesn’t check that often, whether or not you should and if it’s not behind a pay wall or authentication they just hand it over. And if we’re talking about like content delivery networks like CDNs, again that’s makes it difficult in you can’t get the latest information but it depends on how long you push your cache, your cache information out to your CDN. You know if you’re updating your content to you know your cloud flare, flare servers in Spain and the other Frankfort, the other one in California. You know if you’re pushing it out every 10 minutes, it really doesn’t affect me that much because it’s got all the information that it needs. And a lot of times if you’re going to use say the request library, requests allows to redirect and what is the CDN do, it caches certain information if the information that you want is not available or redirect you back to the home server, it directs you back to the source. If you allow your bot to just follow that redirect, the cloudflare and the content delivery networks actually push you back home rather than blocking you completely. But again, CDN is just another server, so as long as you can convince that particular cache server in Spain that you need the information that’s located in Mountain View and you’re a person, it will shove you along just like any other host request or git request.
Dave: Yeah. I just know that I have with either my nag years over or my slack bot whose ring you know just checking the health of some applications, I’ve locked myself out, well not lock myself out but you know its a cloud flare has detected an unusual activity because I was hitting it too hard with you know automated script.So I just actually get why they look smarter than your typical you know.
Jake: Well, I’m not actually saying that they’re not. But I think the level of intelligence that I’ve seen from any sort of filtering system, they usually will filter based on one of three thing or a combination of them: IP address, user agent and how fast you’re hitting them. So for instance, respecting the web server will actually get you a long way. Like if you wan to click a thousand request per minute but you throttle your bot to only collect 10, you now it’s a matter of being the least prominent annoyance to whatever the filter is. Because there are other filters hitting it much harder, so flying under the radar is also huge. Like you said you were really hitting it hard with the script. So I’m sure if you throttle that script, only like 3 or 4 request per minute, sometimes it’s not only the case especially for checking in on something and you need instant information but in the field of collection of data. Really time’s not a factor as much as you think, so as long as you slow the bot down and you’re least squeaky wheel or the least pain in the butt to that filter, that content, to , firewall. Ultimately, it won’t check you that hard so if you make even like a marginal attempt explaining yourself as not a bot, it ultimately has better things to do that time and it has bigger fish to fry so it let’s you through.
Dave:That’s still so much faster than a human could do.
Jake: Absolutely! You try copying and pasting 10 things from 10 different we pages that’s tough, that’s tough.
Dave: And the bots don’t usually have to take restroom breaks.
Jake: No, they do not. They can run 24 hours a day, as long as you do not mind the sound of a fan. If you’re running your old rack mounted Dell server or do it in the cloud, somebody else has to worry about the fan. Although that is expensive. I will say, yeah, I’m really aspiring bot developers, if you would like to build a web scraper, don’t host it in the cloud. Don’t do that until you make a serious change from it. Because, you know, bots really don’t take up a whole lot of memory, but they will eat CPU clock cycles, like you would not believe. And they will hammer the the limited amount of bandwidth, you have to rethink two terabytes of download bandwidth is a lot, you know, download three or four megabytes worth of web pages 10 or 15 times a minute for 24 hours a day. I mean, that’s a lot. That’s a lot. That adds up really quick. Just Just do it on an old desktop that somebody throws away.
Dave: So what what do you think the next thing that’s going to happen in this space might be? What do you what do you think it’ll be? And what do you hope it’ll be?
Jake: Well, I think that there will probably be laws in place that allow websites, hosts and domain registrar’s to actually prosecute and not that there aren’t now, it depends on where you live like in the United States, not only does it depend on the federal laws, but it depends on the laws of the local areas were the requests originate. And of course, in the US, anybody can sue you for any reason, or no reason at all. They just have to prove that they lost money or face in their business over it. But I honestly think that the next move in this space due to high profile breaches, and you know, things like misuse of data, I think there’s going to be a big clamp down on it, which makes me a bit nervous, for sure. But then, again, half of my businesses helping to fight these things. So maybe it’ll be good, I don’t know. But for instance, like any website that relies on collection of data from other sources, let’s say like, trivago or kayak, a lot of travel websites that have the, the information for different airlines, you know, all of these websites rely on a data aggregate bought of some kind, I mean, all of them, because not a lot of the smaller websites, like if you’re going to get the flight times from the airport in, you know, Jackson, Mississippi, or, or, or Mobile, Alabama, you know, small or not huge cities if they’re even are ones there. But if you’re going to get you know, the Flagstaff airports, flight times, they don’t have an API right happening to you gotta go get it. So all these bots like this data records have been around for years. The problem is they don’t do any harm. Really, it’s it’s when you know, they start interfering with elections or when political motivations become involved, which seems to be the case in a lot of legislation. It’s not a big deal until it affects government and it’s starting to affect government going to be a big deal soon. What I hope the next step will be is not necessarily like a crackdown on the use of them, but maybe guidelines on what acceptable and what’s not. And there can be some confusion because focusing isn’t that the same thing? Not really. Because if somebody comes out against bots, especially a government, even if they have rules in place that where it’s okay, you’re still demonized a little bit, you’re still the bad guy, even if you’re following the rules would like for people to understand what services rely on data aggregators, what services rely on web scrapers? how a lot of the web is fundamentally built on web scrapers? Has anybody ever heard of Google? They use a ton of web crawlers. Yeah, I think making like the robots. txt file that you have to abide by that right now. It’s kind of a guideline, like, it’s okay to ignore it if you want. But I think forcing people to rely to basically allow or disallow and then forcing, bot developers to have to obey that having their be real, legitimate consequences, I think is what’s next I would like there to be consequences, because I would like there to be rules, you know, everything in the wild west is allowed until somebody stands up and says, No, and then everything is disallowed. Yeah, and that kind of like all or nothing going to extremes, I think, is where bots are headed. I hope not because one, it’s probably going to break a lot of the internet in two, it’s what I do. So I would like to be able to continue doing it within rules. I, you know, and a lot of us, I’ve talked to a couple other bought developers, a lot of us kind of, we have our own moral code and, and ethics, and some are a lot more lenient than others, I try to hold myself to a standard where I’m really not taxing a web server, I have a blog series that only has one part up right now, second one’s coming about proxy cycling and bots and, and things like that. And the biggest message that I give to people who want to learn is respect the web server know that if you’re hammering a web server that’s only got a gig of RAM, and, you know, really weak processor, your thought can’t be, you know, if this is their livelihood, they should have a better server. And that’s not up to you. That’s not up to you at all. So you have to stay within the constraints that that person thought they needed, because it’s also not not up to you to take their server down and DDos them because you feel like it. But I think if that is just a general law, or or at least a rule rather than a moral code that can very easily be thrown out the window when there’s a big enough check involved. I hope that’s where it goes. I’m nervous that it won’t, but I hope that’s where it goes.
Dave: I know usually it’s a there’s an over correction before there’s a correction. back to reality.
Jake: Yeah, there is. And honestly, if somebody tries to ban bots, you can guarantee that, you know, there will be a certain player from Mountain View California, that’s going to step right in and and fire their lobbyists up and that’s going to be that, but that’s in the US
Dave: You have to, at least to even. Microsoft still has bing.
Jake: Oh, my work. Yeah, that’s true. I forgot. And then but even like duck, duck, go.
Dave: Yeah.
Jake: Which everybody loves duck, duck, go because of it’s privacy. But what I didn’t think about it, what happens when people come out against bots and duck, duck go says, Hey, I’m trying to advocate for privacy. And I use a bot. And what you’re about to do is going to take down this private, no data collected on the on the users search. NET search engine. What then we’re all going to go back to Google, we’re all going to have our lives invaded, and what are you what are you going to do about it. So that becomes a really, really, really complicated topic, especially in the in the government of where I live in the US. But globally, it kind of makes us re examine not only how the web works, but how we use it, because nobody does anything without Google, Google goes away, or at least their search engine does what happens, that’s a utility at this point, the world doesn’t function the way it does without it. So it’s, that’s a tough question. But again, like you said, This typically over correction all the time in the world we live in now is great for tech. But it’s also tough for tech. Because the big players in tech, which, again, they represent a very small portion of tech companies, but they kind of set the standard for what’s allowed, what’s not allowed, which means startups go in their direction, hey, if it’s okay to collect everything, and then sell ads, when we do that, you know, they follow their example. So, of course, you make an example out of a big company and all the startups start producing things other than data collection machines and social networks, which is good. But ultimately, there has to be a big learning curve, because there’s no Google search anymore. So it’s a tough topic, for sure.
Dave: Well, and then, I mean, you know, anything can be argued in Law. But, I mean, then what happens with RSS feed readers? So, Fili,Flipboard, what happens with pod catchers?
Jake: So,I mean, our APIs Okay, anymore is using a web API, all right. because technically, you’re accessing web data through automated means. Does that become okay? It is that not allowed anymore? So how do company share data? Yeah, there’s a big influence there. But the problem is, there are certain players in the tech industry that misuse that. And then we’ll start crying about what we’re talking about right now. Oh, the web won’t work the same way it does. If I don’t do this, it’s like, oh, no, the web doesn’t work the same way it does. If this technology is not allowed, the way you’re using it. It’s not the way it’s intended to me, because it’s kind of malicious. Like, that’s not cool. We can do that. Or you shouldn’t be able to, yeah, well, then, where do you draw the line, and I hope they draw the line on the side where we can still use the services, we still build these services, we can still make the world a more communicative, more open place. I mean, I know there’s big movements to read decentralize the web, which I think it’d be great. But, you know, it’s going to be tough to release that stranglehold that a lot of people have on what was meant to be an open platform, you know, Burners- Lee didn’t intend it to be this way. And it is now. So how are we going to get back to that? If we are and what’s going to happen in the meantime?
Dave: Yeah, yeah. what kind what type of companies don’t think they need your service, or what know they need your service that that should really consider it.
Jake: Oof, um, any data driven company, I think, needs, needs needs needs an aggregator of some kind. And I’ve talked to some environmental services that were super interested in the potential but ultimately weren’t really interested in pulling the trigger, I talked to a lot of companies and environmental services or financial services that have a big data set of their own data and call it good, which is fine for most purposes. But I think if you have a big data set, you built yourself, and it’s built with some errors involved, then your entire data set is going to be fundamentally biased, right, that’s machine learning one on one is unbiased data. Whereas if you have a service that allows you to collect similar publicly available data on the same topic, you can cross reference and you can clean your data better. So that’s another thing that machine learning and data science don’t really work without data. And data is and isn’t always producible in house. So any data driven service that either needs to cross reference but does doesn’t want to, you know, copy and paste four gigabytes worth of strings into a CSV file, or a company that wants to solve a problem based on what data is available and have no means of getting it. Because again, it’s four gigabytes of text that they need to copy and paste by hand. So basically, financial medical, I think would be huge, because how many medical journals are there out there that talk about things that doctors should know, surgeon should know, but it’s the patience bringing those articles in and going, Hey, this is what I found on bloodless surgery. This is what I found on the use of surgical steel. And the surgeon who does this for a living goes, I did not know that I think that’s kind of weird. That’s like a fundamental thing that they should know. Like, hey, there’s a better way of making people not die. I never heard of that, well, I read this blog or quick google search and here it is, you know, accredited by Brigham and Women’s or Harvard Medical School, whatever, I think that data should be more readily available. And I feel like it getting that most current stuff into the hands of the people that need it, either in finance or medicine, or environmental control. You know, again, a lot of studies out there, if you look into the scientific method, they do have flaws. But that doesn’t mean all the data they collect is irrelevant. A lot of people kind of Yes, or no pass fail on that, like, Oh, this, this study on the polar ice says this, but there’s one piece in it, that’s not that’s, you know, up for interpretation. Therefore, the whole studies invalid? Well, if you had an opportunity to be able to cross reference at one study with the 40 other identical studies that have been published, and feed it into something that can say, okay, they all agree on this, and none of them agree on that, or they all agree on this. And this is something that’s kind of in the gray area, I think that that moves forward, the ability to make decisions based on public studies is because we have all the data that’s kind of been compiled, aggregated, cleaned, and then displayed in a way that’s unbiased that makes sense in the basically isn’t telling you anything at things is basically saying, hey, these facts appear in all of these journals. And these facts don’t appear in any of them, except this one, etc, etc. So you can make a more informed decision rather than saying, did you hear or did you read that this guy did you know, etc?
Dave: Well, that’s part of what we’re trying to solve with medic.
Jake: Oh, really? Okay.
Dave: Yeah. Yeah. We’re aggregating all of the medical information on the web, through blogs, journals, websites
Jake: That’s awesome.
Dave: Social, you know, everything out there that a doctor may need to know. And putting it into their hands. And usually learning to to sift out the right content for them based on their their preferences, their specialty, their location, they’re behaving and peer to peer recommendations. So trying to make it easier. I mean, there’s so much content in medicine that it’s impossible to keep up with
Jake: All my word. I mean, the fact that every journal is 6000 words, and you have to read 12 of them to even know what they’re talking about. Yes, I’m not reading 30, I’m not reading, you know, 72,000 words, just to find out what the best scalpel is to use. I honestly don’t care that much. It’s important, but I have I have people that I need to take limbs off of, and replace with us right now. I can’t do that. That’s amazing. That’s really cool. Okay, good for you Dave. That’s awesome.
Dave: Thanks. Yeah, we’re really happy with it. I mean, it’s a start up so long way to go, but
Jake: Oh, it’s always a long way to go. For sure.
Dave: Oh, yeah. That’s, you know, we’ve been working on it for almost two years. Now. It feels like we’re almost to the starting line.
Jake: It’s like you’re getting to Boston Marathon. You know, you’ve got to start the race two miles before that starting line and then twenty two point two Yeah, absolutely. Of course, I get that.
Dave: All right. Is there anything that we haven’t covered that you definitely want to get off your chest or make sure listeners would know about?
Jake: Yeah, I mean, I think that first and foremost in the field of machine learning. I know this is something that’s that’s big on people. And this is something that I say to a lot of folks, in the field of machine learning and AI it’s something that scare a lot of people. And I only have to say one word, and it describes it, which is Skynet. Everybody thinks that’s what’s coming. Everybody thinks that Terminator is coming and it’s always been my opinion that it really depends on what these bots are taught. What what AI is taught because yeah, it is like putting a four year old sociopath in Arnold Schwarzenegger’s body and just saying go do what you think is cool. Yeah, that’s not going to end well I get that but you’re just as likely to get leaner Lieutenant Commander data from Star Trek as you are a Terminator. And I think the difference is if you know about those who you know those two fandoms and I’m sure most your listeners do. You know Terminator was an AI that was developed exclusively the Skynet for exclusively for combat, that was it. It was designed to analyze war scenarios, you know, come out with the way that the United States wins under every circumstance. And then it became self aware. And people freaked out, tried to shut it down. Now, what I’ve always thought is at that point, it’s kind of become self aware and goes, Hey, I’m alive. If people don’t freak out, if people just acknowledge that and then try to teach it, that human life does have value, what happens, then you look the same thing with, you know, the idea of like, data from Star Trek, he wanted to be human, he wanted to empathize. Why, because he was raised in a laboratory where people loved him and raised him like a child. I think if you teach whatever in AI, is taught, that’s what it’s going to do best if ever become self aware. So if you teach it, that human life has value, you then you get them to commander data. If you teach it, that it’s okay to kill so long as the person is quote unquote, the enemy you get Terminator. That’s a big thing. And I wanted to speak to people, but also when it comes to machine learning in your daily lives. I also want non technical users to know machine learning is a scary term. But ultimately, the machines that we’re teaching at this point are not that smart. Even even they’re really really smart AI is that are out there like your Watson’s, or even though like the bots of Boston dynamic is there androids a Boston Dynamics building that can do interviews and things like that it’s ultimately really, really, really clever automation, those machines are not thinking themselves. So be a little bit gentle on us ml developer, we’re not building intelligent machines that are going to do iRobot, you know, Will Smith isn’t going to come busting through your wall at any moment with a fake arm chasing robots, although I’m sure some people love that that’s just not not even close. And I think there’s so much more that can be done in machine learning other than project Maven and there is there is so much more useful stuff done in the field of automation and machine learning and artificial intelligence, than simply drone strikes that kill people that that’s not what the field is about. And I don’t think that’s where it’s headed. Because even when Google said, project Maven, we’re not even getting paid that much from the government, which basically means there are far more lucrative uses for this technology than death. And these companies developing them are all about keeping as far in the black as they can. Which means as long as there are more profitable uses for AI than war, then we’re not going to get Skynet. And even if it became more profitable for we’re not going to kind of anytime soon, nothing is going to be self aware. anytime soon. As far as I’m thinking, only it was something amazing happens.
Dave:There’s so much so much of what people call AI. Is little more than pattern recognition.
Jake: Absolutely.
Dave: I mean, there there are, I am by no means a machine learning or an AI expert. That’s why I have data engineers. Yeah, I kept all the bits working together. And they make them smarter. But it’s all about pattern recognition, and, and sorting and probabilities and things like that. Nothing that is being created right now has a desire. Until something has a desire is programmed with one. There’s no way that it’s going to go and try to take over the world unless it’s been programmed to do everything that it can to replicate itself, or to remove any sorts of threats to its existence,
Jake: Right and what we all need to remember is that any machine, no matter how advanced at this point, only follows human instruction. The code that is in it,is built by people. It is written by folks. And all code is is a set of instructions. A lot of people ask, what the different you know, what is firmware? It’s a set of instructions. What is software? It’s a slightly prettier set of instructions, you know, like anything that we tell a machine it does. And that’s why when you look at a compiler error, and people like this stupid language, it’s like, Listen, the machines only doing exactly what you told them. Which means if it’s wrong, you told it wrong. And that’s frustrating for all of us. I get it. I yell at my computer sometimes. But I apologize. That’s weird.
Dave: This this as technologists, we all day swinging back and forth between I’m a genius. I’m an idiot. I’m a genius, I’m an idiot.
Jake: Absolutely.
Dave: Now, I’m going to spend more time on the genius side and less time on the idiot side.
Jake: And that success is like the success at batting in baseball. If you’re on the genius side, three out of 10 times and the idiot side seven, you are a legend.
Dave: That’s right, exactly.
Jake: You have a lifetime of your baton, a lifetime 300. You are legendary. You are Nikola Tesla, at this point because he only got one thing, right that the world remembers and that’s alternating current. So he was one out of 10 more most genius of the past two centuries. Just think about that, guys. Think about it out there.
Dave: Absolutely.
Jake: And don’t be afraid of your technology. Whether you’re a developer, whether you’re a junior, whether you’re mid level senior, which I don’t know what those terms mean. But if you are doesn’t matter who you are, if you’re a user administrator, don’t be afraid of your tech. Oh, my word. It’s not out to get you. But it’s also not out to just annoy you. Like it’s not doing what it’s doing to spite you. It doesn’t feel like that, although I will say there is a professor of psychology at Northeastern University, I do for get his name. But he is working with Harvard and MIT right now to try and quantify emotional data points for uploading two machine. Therefore the problem that SciFi brings out that machines can’t feel before the genocidal maybe solve before machines their self aware they’re trying to give them emotions. So don’t worry, your machines are okay. Don’t yell at Siri. She can’t understand you. You know, she, she doesn’t. She doesn’t understand that you’re saying 95 instead of 295. It’s okay. She’ll learn, it’s alright. Let’s Apple Mac sucks. Download Google Maps. Don’t do that.
Dave: Alright. Well, thank you very much for joining us.
Jake: Thank you for having me.
Dave: It’s a pleasure to talk to you.
Jake: You as well. Thank you so much.
Dave: Alright, thanks, everybody. Bye.
Jake: Thank you.
Until next time, remember, any sufficiently advanced technology is indistinguishable from magic.